nips nips2005 nips2005-5 knowledge-graph by maker-knowledge-mining

5 nips-2005-A Computational Model of Eye Movements during Object Class Detection


Source: pdf

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu‡ Abstract We present a computational model of human eye movements in an object class detection task. [sent-10, score-1.371]

2 The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. [sent-11, score-1.665]

3 We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). [sent-12, score-1.518]

4 We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance. [sent-13, score-1.046]

5 Introduction Object detection is one of our most common visual operations. [sent-15, score-0.335]

6 Whether we are driving [1], making a cup of tea [2], or looking for a tool on a workbench [3], hundreds of times each day our visual system is being asked to detect, localize, or acquire through movements of gaze objects and patterns in the world. [sent-16, score-0.572]

7 In the human behavioral literature, this topic has been extensively studied in the context of visual search. [sent-17, score-0.496]

8 In a typical search task, observers are asked to indicate, usually by button press, whether a specific target is present or absent in a visual display (see [4] for a review). [sent-18, score-0.527]

9 A bedrock finding in this literature is that, for targets that cannot be defined by a single visual feature, target detection times increase linearly with the number of nontargets, a form of clutter or ”set size” effect. [sent-20, score-0.609]

10 Moreover, the slope of the function relating detection speed to set size is steeper (by roughly a factor of two) when the target is absent from the scene compared to when it is present. [sent-21, score-0.499]

11 Search theorists have interpreted these findings as evidence for visual attention moving serially from one object to the next, with the human detection operation typically limited to those objects fixated by this ”spotlight” of attention [5]. [sent-22, score-1.225]

12 Object class detection has also been extensively studied in the computer vision community, with faces and cars being the two most well researched object classes [6, 7, 8, 9]. [sent-23, score-0.664]

13 The related but simpler task of object class recognition (target recognition without localization) has also been the focus of exciting recent work [10, 11, 12]. [sent-24, score-0.471]

14 Scenes are typically realistic and highly cluttered, with object appearance varying greatly due to illumination, view, and scale changes. [sent-26, score-0.298]

15 The task addressed in this paper falls between the class detection and recognition problems. [sent-27, score-0.348]

16 Like object class detection, we will be detecting and localizing class-defined targets; unlike object class detection the test images will be composed of at most 20 objects appearing on a simple background. [sent-28, score-1.218]

17 Both the behavioral and computer vision literatures have strengths and weaknesses when it comes to understanding human object class detection. [sent-29, score-0.864]

18 Moreover, this literature has focused almost entirely on object-specific detection, cases in which the observer knows precisely how the target will appear in the test display (see [15] for a discussion of target non-specific search using featurally complex objects). [sent-31, score-0.504]

19 The current study draws upon the strengths of both of these literatures to produce the first joint behavioral-computational study of human object class detection. [sent-33, score-0.718]

20 First, we use an eyetracker to quantify human behavior in terms of the number of fixations made during an object class detection task. [sent-34, score-0.985]

21 Then we introduce a computational model that not only performs the detection task at a level comparable to that of the human observers, but also generates a sequence of simulated eye movements similar in pattern to those made by humans performing the identical detection task. [sent-35, score-1.343]

22 Experimental methods An effort was made to keep the human and model experiments methodologically similar. [sent-37, score-0.376]

23 Both experiments used training, validation (practice trials in the human experiment), and testing phases, and identical images were presented to the model and human subjects in all three of these phases. [sent-38, score-0.867]

24 The target class consisted of 378 teddy bears scanned from [16]. [sent-39, score-0.493]

25 Nontargets consisted of 2,975 objects selected from the Hemera Photo Objects Collection. [sent-40, score-0.224]

26 Samples of the bear and nontarget objects are shown in Figure 1. [sent-41, score-0.387]

27 Figure 1: Representative teddy bears (left) and nontarget objects (right). [sent-43, score-0.51]

28 In the case of the human experiment, each of these objects was shown centered on a white background and displayed for 1 second. [sent-45, score-0.482]

29 No objects were repeated between training and testing, and no objects were repeated within either of the training or testing phases. [sent-47, score-0.442]

30 Test images depicted 6, 13, or 20 color objects randomly positioned on a white background. [sent-48, score-0.327]

31 Human subjects were instructed to indicate, by pressing a button, whether a teddy bear appeared among the displayed objects. [sent-50, score-0.297]

32 Each test trial in the human experiment began with the subject fixating gaze at the center of the display, and eye position was monitored throughout each trial using an eyetracker. [sent-52, score-0.816]

33 Model of eye movements during object class detection Figure 2: The flow of processing through our model. [sent-55, score-1.028]

34 Retina transform With each change in gaze position (set initially to the center of the image), our model transforms the input image so as to reflect the acuity limitations imposed by the human retina. [sent-60, score-0.61]

35 We used the method described in [19, 20], which was shown to provide a close approximation to human acuity limitations, to implement this dynamic retina transform. [sent-61, score-0.338]

36 Create target map Each point on the target map ranges in value between 0 and 1 and indicates the likelihood that a target is located at that point. [sent-64, score-0.623]

37 2), then compare the features surrounding these points to features of the target object class extracted during training. [sent-67, score-0.756]

38 Two types of discriminative features were used in this study: color features and texture features. [sent-68, score-0.395]

39 Color features Color has long been used as a feature for instance object recognition [21]. [sent-72, score-0.486]

40 In our study we explore the potential use of color as a discriminative feature for an object class. [sent-73, score-0.449]

41 Given a test image, It , and its color feature, Ht , we compute the distances between Ht and the color features of the training set {Hi , i = 1, . [sent-79, score-0.376]

42 Texture features Local texture features were extracted on the gray level images during both training and testing. [sent-87, score-0.353]

43 Following the method described in [11, 12], we used AdaBoost during training to select a small set of SIFT features from among all the SIFT features computed for each sample in the training set. [sent-91, score-0.292]

44 Eventually, T features were chosen having the best ability to discriminate the target object class from the nontargets. [sent-99, score-0.651]

45 Each of these selected features forms a weak classifier, hk , consisting of three components: a feature vector, (fk ), a distance threshold, (θk ), and an output label, (uk ). [sent-100, score-0.237]

46 Validation A validation set, consisting of the practice trials viewed by the human observers, was used to set parameters in the model. [sent-108, score-0.396]

47 (4) c Wtexture = c + t The final combined output was used to generate the values in the target map and, ultimately, to guide the model’s simulated eye movements. [sent-114, score-0.607]

48 Recognition We define the highest-valued point on the target map as the hotspot. [sent-117, score-0.227]

49 If the hotspot value exceeds the high target-present threshold, then the object will be recognized as an instance of the target class. [sent-119, score-0.639]

50 If the hotspot value falls below the target-absent threshold, then the object will be classified as not belonging to the target class. [sent-120, score-0.639]

51 This constraint was introduced so as to avoid extremely high false positive rates stemming from the creation of false targets in the blurred periphery of the retina-transformed image. [sent-123, score-0.242]

52 Eye movement If neither the target-present nor the target-absent thresholds are satisfied, processing passes to the eye movement stage of our model. [sent-126, score-0.639]

53 If the simulated fovea is not on the hotspot, the model will make an eye movement to move gaze steadily toward the hotspot location. [sent-127, score-0.946]

54 Fixation in our model is defined as the centroid of activity on the target map, a computation consistent with a neuronal population code. [sent-128, score-0.262]

55 Eventually, this thresholding operation will cause the centroid of the target map to pass an eye movement threshold, resulting in a gaze shift to the new centroid location. [sent-130, score-0.977]

56 See [18] for details regarding the eye movement generation process. [sent-131, score-0.464]

57 If the simulated fovea does acquire the hotspot and the target-present threshold is still not met, the model will assume that a nontarget was fixated and this object will be ”zapped”. [sent-132, score-0.768]

58 Zapping consists of applying a negative Gaussian filter to the hotspot location, thereby preventing attention and gaze from returning to this object (see [24] for a previous computational implementation of a conceptually related operation). [sent-133, score-0.707]

59 Experimental results Model and human behavior were compared on a variety of measures, including error rates, number of fixations, cumulative probability of fixating the target, and scanpath ratio (a measure of how directly gaze moved to the target). [sent-135, score-0.733]

60 For each measure, the model and human data were in reasonable agreement. [sent-136, score-0.343]

61 Table 1: Error rates for model and human subjects. [sent-137, score-0.379]

62 2% Table 1 shows the error rates for the human subjects and the model, grouped by misses and false positives. [sent-142, score-0.549]

63 Note that the data from all eight of the human subjects are shown, resulting in the greater number of total trials. [sent-143, score-0.398]

64 First, despite the very high level of accuracy exhibited by the human subjects in this task, our model was able to Table 2: Average number of fixations by model and human. [sent-145, score-0.48]

65 Second, and consistent with the behavioral search literature, miss rates were larger than false positive rates for both the humans and model. [sent-163, score-0.352]

66 To the extent that our model offers an accurate account of human object detection behavior, it should be able to predict the average number of fixations made by human subjects in the detection task. [sent-164, score-1.516]

67 Data are grouped by targetpresent (p), target-absent (a), and the number of objects in the scene (6, 13, 20). [sent-166, score-0.221]

68 In all conditions, the model and human subjects made comparable numbers of fixations. [sent-167, score-0.472]

69 Also consistent with the behavioral literature, the average number of fixations made by human subjects in our task increased with the number of objects in the scenes, and the rate of this increase was greater in the target-absent data compared to the target-present data. [sent-168, score-0.692]

70 The fact that our model is able to capture an interaction between set size and target presence in terms of the number of fixations needed for detection lends support for our method. [sent-170, score-0.466]

71 Plotted are the cumulative probabilities of fixating the target as a function of the number of objects fixated during the search task. [sent-173, score-0.473]

72 When the scene contained only 6 or 13 objects, the model and the humans fixated roughly the same number of nontargets before finally shifting gaze to the target. [sent-174, score-0.417]

73 When the scene was more cluttered (20 objects), the model fixated an average of 1 additional nontarget relative to the human subjects, a difference likely indicating a liberal bias in our human subjects under these search conditions. [sent-175, score-1.034]

74 Overall, these analyses suggest that our model was not only making the same number of fixations as humans, but it was also fixating the same number of nontargets during search as our human subjects. [sent-176, score-0.524]

75 Table 3: Comparison of model and human scanpath distance #Objects Human Model MODEL 6 1. [sent-177, score-0.54]

76 43 Human gaze does not jump randomly from one item to another during search, but instead moves in a more orderly way toward the target. [sent-186, score-0.231]

77 The ultimate test of our model would be to reproduce this orderly movement of gaze. [sent-187, score-0.222]

78 , the summed distance traveled by the eye) and the distance between the target and the center of the image (i. [sent-191, score-0.316]

79 , the minimum distance that the eye would need to travel to fixate the target). [sent-193, score-0.381]

80 As indicated in Table 3, the model and human data are in close agreement in the 6 and 13-object scenes, but not in the 20-object scenes. [sent-194, score-0.384]

81 Upon closer inspection of the data, we found several cases in which the model made multiple fixations between two nontarget objects, a very unnatural behavior arising from too small of a setting for our Gaussian ”zap” window. [sent-195, score-0.254]

82 When these 6 trials were removed, the model data (MODEL) and the human data were in closer agreement. [sent-196, score-0.374]

83 Model data are shown in thick red lines, human data are shown in thin green lines. [sent-198, score-0.302]

84 Figure 4 shows representative scanpaths from the model and one human subject for two search scenes. [sent-199, score-0.487]

85 Although the scanpaths do not align perfectly, there is a qualitative agreement between the human and model in the path followed by gaze to the target. [sent-200, score-0.605]

86 Very often, we need to search for dogs, or chairs, or pens, without any clear idea of the visual features comprising these objects. [sent-203, score-0.292]

87 Despite the prevalence of these tasks, the problem of object class detection has attracted surprisingly little research within the behavioral community [15], and has been applied to a relatively narrow range of objects within the computer vision literature [6, 7, 8, 9]. [sent-204, score-0.982]

88 First, we provide a detailed eye movement analysis of human behavior in an object class detection task. [sent-206, score-1.416]

89 Second, we incorporate state-of-the-art computer vision object detection methods into a biologically plausible model of eye movement control, then validate this model by comparing its behavior to the behavior of our human observers. [sent-207, score-1.535]

90 Computational models capable of describing human eye movement behavior are extremely rare [25]; the fact that the current model was able to do so for multiple eye movement measures lends strength to our approach. [sent-208, score-1.356]

91 Moreover, our model was able to detect targets nearly as well as the human observers while maintaining a low false positive rate, a difficult standard to achieve in a generic detection model. [sent-209, score-0.762]

92 Such agreement between human and model suggests that simple color and texture features may be used to guide human attention and eye movement in an object class detection task. [sent-210, score-2.094]

93 Future computational work will explore the generality of our object class detection method to tasks with visually complex backgrounds, and future human work will attempt to use neuroimaging techniques to localize object class representations in the brain. [sent-211, score-1.352]

94 In what ways do eye movements contribute to everyday activities. [sent-226, score-0.429]

95 A statistical method for 3d object detection applied to faces and cars. [sent-250, score-0.52]

96 Rapid object detection using a boosted cascade of simple features. [sent-256, score-0.52]

97 Weak hypotheses and boosting for generic object detection and recognition. [sent-279, score-0.555]

98 Efficient visual search by category: Specifying the features that mark the difference between artifacts and animal in preattentive vision. [sent-307, score-0.292]

99 ), Neurobiology of attention, chapter Specifying the components of attention in a visual search task, pages 395–400. [sent-327, score-0.275]

100 Algorithms for defining visual regions-of-interest: comparison with eye fixations. [sent-371, score-0.445]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eye', 0.332), ('human', 0.302), ('object', 0.298), ('detection', 0.222), ('xations', 0.214), ('gaze', 0.182), ('objects', 0.18), ('hotspot', 0.172), ('target', 0.169), ('scanpath', 0.148), ('movement', 0.132), ('nontarget', 0.129), ('xated', 0.129), ('teddy', 0.123), ('xating', 0.123), ('color', 0.115), ('sift', 0.114), ('visual', 0.113), ('nontargets', 0.107), ('zelinsky', 0.107), ('features', 0.105), ('ht', 0.098), ('movements', 0.097), ('subjects', 0.096), ('adaboost', 0.082), ('behavioral', 0.081), ('false', 0.079), ('class', 0.079), ('bears', 0.078), ('bear', 0.078), ('search', 0.074), ('observers', 0.07), ('texture', 0.07), ('vision', 0.065), ('hayhoe', 0.064), ('validation', 0.063), ('map', 0.058), ('literature', 0.057), ('attention', 0.055), ('scenes', 0.055), ('brook', 0.055), ('stony', 0.055), ('centroid', 0.052), ('behavior', 0.051), ('cumulative', 0.05), ('hue', 0.049), ('land', 0.049), ('orderly', 0.049), ('image', 0.049), ('classi', 0.049), ('distance', 0.049), ('cluttered', 0.049), ('targets', 0.048), ('simulated', 0.048), ('weak', 0.047), ('recognition', 0.047), ('humans', 0.046), ('consisted', 0.044), ('hi', 0.043), ('thresholds', 0.043), ('backgrounds', 0.043), ('training', 0.041), ('threshold', 0.041), ('model', 0.041), ('visually', 0.041), ('scene', 0.041), ('agreement', 0.041), ('rao', 0.039), ('fovea', 0.039), ('literatures', 0.039), ('scanpaths', 0.039), ('xation', 0.039), ('slope', 0.037), ('volume', 0.037), ('acuity', 0.036), ('button', 0.036), ('misses', 0.036), ('cvpr', 0.036), ('feature', 0.036), ('rates', 0.036), ('boosting', 0.035), ('ers', 0.035), ('er', 0.035), ('display', 0.035), ('lends', 0.034), ('made', 0.033), ('pages', 0.033), ('ijcv', 0.033), ('localize', 0.033), ('november', 0.033), ('spie', 0.033), ('validated', 0.033), ('histogram', 0.032), ('images', 0.032), ('trials', 0.031), ('normalized', 0.031), ('representative', 0.031), ('absent', 0.03), ('appearing', 0.03), ('itti', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.

2 0.3472614 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artificial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is defined by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.

3 0.21024296 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth

Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1

4 0.19884619 131 nips-2005-Multiple Instance Boosting for Object Detection

Author: Cha Zhang, John C. Platt, Paul A. Viola

Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MILBoost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier. 1

5 0.16289821 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

Abstract: A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts.

6 0.15723568 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

7 0.14256732 34 nips-2005-Bayesian Surprise Attracts Human Attention

8 0.13296761 151 nips-2005-Pattern Recognition from One Example by Chopping

9 0.13088614 98 nips-2005-Infinite latent feature models and the Indian buffet process

10 0.12963137 149 nips-2005-Optimal cue selection strategy

11 0.12810238 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

12 0.12170541 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

13 0.12111335 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

14 0.09580192 136 nips-2005-Noise and the two-thirds power Law

15 0.094604403 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

16 0.08939448 110 nips-2005-Learning Depth from Single Monocular Images

17 0.089274354 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

18 0.088015884 141 nips-2005-Norepinephrine and Neural Interrupts

19 0.086248629 79 nips-2005-Fusion of Similarity Data in Clustering

20 0.085018732 203 nips-2005-Visual Encoding with Jittering Eyes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.249), (1, -0.061), (2, 0.029), (3, 0.393), (4, -0.031), (5, 0.244), (6, 0.109), (7, 0.141), (8, -0.269), (9, -0.073), (10, 0.124), (11, 0.015), (12, -0.081), (13, 0.182), (14, 0.01), (15, 0.007), (16, -0.215), (17, -0.079), (18, 0.136), (19, 0.033), (20, -0.056), (21, -0.049), (22, -0.007), (23, -0.031), (24, -0.002), (25, 0.022), (26, -0.028), (27, 0.047), (28, -0.013), (29, -0.07), (30, 0.067), (31, -0.03), (32, 0.015), (33, -0.028), (34, 0.063), (35, -0.033), (36, 0.045), (37, 0.079), (38, -0.027), (39, -0.003), (40, 0.011), (41, 0.03), (42, -0.039), (43, -0.08), (44, -0.085), (45, 0.002), (46, 0.031), (47, -0.035), (48, 0.0), (49, -0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97444171 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.

2 0.87393147 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artificial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is defined by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.

3 0.65707022 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth

Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1

4 0.64490604 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

Abstract: A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts.

5 0.60452271 34 nips-2005-Bayesian Surprise Attracts Human Attention

Author: Laurent Itti, Pierre F. Baldi

Abstract: The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantifies how data affects a natural or artificial observer, by measuring the difference between posterior and prior beliefs of the observer. Using this framework we measure the extent to which humans direct their gaze towards surprising items while watching television and video games. We find that subjects are strongly attracted towards surprising locations, with 72% of all human gaze shifts directed towards locations more surprising than the average, a figure which rises to 84% when considering only gaze targets simultaneously selected by all subjects. The resulting theory of surprise is applicable across different spatio-temporal scales, modalities, and levels of abstraction. Life is full of surprises, ranging from a great christmas gift or a new magic trick, to wardrobe malfunctions, reckless drivers, terrorist attacks, and tsunami waves. Key to survival is our ability to rapidly attend to, identify, and learn from surprising events, to decide on present and future courses of action [1]. Yet, little theoretical and computational understanding exists of the very essence of surprise, as evidenced by the absence from our everyday vocabulary of a quantitative unit of surprise: Qualities such as the “wow factor” have remained vague and elusive to mathematical analysis. Informal correlates of surprise exist at nearly all stages of neural processing. In sensory neuroscience, it has been suggested that only the unexpected at one stage is transmitted to the next stage [2]. Hence, sensory cortex may have evolved to adapt to, to predict, and to quiet down the expected statistical regularities of the world [3, 4, 5, 6], focusing instead on events that are unpredictable or surprising. Electrophysiological evidence for this early sensory emphasis onto surprising stimuli exists from studies of adaptation in visual [7, 8, 4, 9], olfactory [10, 11], and auditory cortices [12], subcortical structures like the LGN [13], and even retinal ganglion cells [14, 15] and cochlear hair cells [16]: neural response greatly attenuates with repeated or prolonged exposure to an initially novel stimulus. Surprise and novelty are also central to learning and memory formation [1], to the point that surprise is believed to be a necessary trigger for associative learning [17, 18], as supported by mounting evidence for a role of the hippocampus as a novelty detector [19, 20, 21]. Finally, seeking novelty is a well-identified human character trait, with possible association with the dopamine D4 receptor gene [22, 23, 24]. In the Bayesian framework, we develop the only consistent theory of surprise, in terms of the difference between the posterior and prior distributions of beliefs of an observer over the available class of models or hypotheses about the world. We show that this definition derived from first principles presents key advantages over more ad-hoc formulations, typically relying on detecting outlier stimuli. Armed with this new framework, we provide direct experimental evidence that surprise best characterizes what attracts human gaze in large amounts of natural video stimuli. We here extend a recent pilot study [25], adding more comprehensive theory, large-scale human data collection, and additional analysis. 1 Theory Bayesian Definition of Surprise. We propose that surprise is a general concept, which can be derived from first principles and formalized across spatio-temporal scales, sensory modalities, and, more generally, data types and data sources. Two elements are essential for a principled definition of surprise. First, surprise can exist only in the presence of uncertainty, which can arise from intrinsic stochasticity, missing information, or limited computing resources. A world that is purely deterministic and predictable in real-time for a given observer contains no surprises. Second, surprise can only be defined in a relative, subjective, manner and is related to the expectations of the observer, be it a single synapse, neuronal circuit, organism, or computer device. The same data may carry different amount of surprise for different observers, or even for the same observer taken at different times. In probability and decision theory it can be shown that the only consistent and optimal way for modeling and reasoning about uncertainty is provided by the Bayesian theory of probability [26, 27, 28]. Furthermore, in the Bayesian framework, probabilities correspond to subjective degrees of beliefs in hypotheses or models which are updated, as data is acquired, using Bayes’ theorem as the fundamental tool for transforming prior belief distributions into posterior belief distributions. Therefore, within the same optimal framework, the only consistent definition of surprise must involve: (1) probabilistic concepts to cope with uncertainty; and (2) prior and posterior distributions to capture subjective expectations. Consistently with this Bayesian approach, the background information of an observer is captured by his/her/its prior probability distribution {P (M )}M ∈M over the hypotheses or models M in a model space M. Given this prior distribution of beliefs, the fundamental effect of a new data observation D on the observer is to change the prior distribution {P (M )}M ∈M into the posterior distribution {P (M |D)}M ∈M via Bayes theorem, whereby P (D|M ) ∀M ∈ M, P (M |D) = P (M ). (1) P (D) In this framework, the new data observation D carries no surprise if it leaves the observer beliefs unaffected, that is, if the posterior is identical to the prior; conversely, D is surprising if the posterior distribution resulting from observing D significantly differs from the prior distribution. Therefore we formally measure surprise elicited by data as some distance measure between the posterior and prior distributions. This is best done using the relative entropy or Kullback-Leibler (KL) divergence [29]. Thus, surprise is defined by the average of the log-odd ratio: P (M |D) S(D, M) = KL(P (M |D), P (M )) = P (M |D) log dM (2) P (M ) M taken with respect to the posterior distribution over the model class M. Note that KL is not symmetric but has well-known theoretical advantages, including invariance with respect to Figure 1: Computing surprise in early sensory neurons. (a) Prior data observations, tuning preferences, and top-down influences contribute to shaping a set of “prior beliefs” a neuron may have over a class of internal models or hypotheses about the world. For instance, M may be a set of Poisson processes parameterized by the rate λ, with {P (M )}M ∈M = {P (λ)}λ∈I +∗ the prior distribution R of beliefs about which Poisson models well describe the world as sensed by the neuron. New data D updates the prior into the posterior using Bayes’ theorem. Surprise quantifies the difference between the posterior and prior distributions over the model class M. The remaining panels detail how surprise differs from conventional model fitting and outlier-based novelty. (b) In standard iterative Bayesian model fitting, at every iteration N , incoming data DN is used to update the prior {P (M |D1 , D2 , ..., DN −1 )}M ∈M into the posterior {P (M |D1 , D2 , ..., DN )}M ∈M . Freezing this learning at a given iteration, one then picks the currently best model, usually using either a maximum likelihood criterion, or a maximum a posteriori one (yielding MM AP shown). (c) This best model is used for a number of tasks at the current iteration, including outlier-based novelty detection. New data is then considered novel at that instant if it has low likelihood for the best model b a (e.g., DN is more novel than DN ). This focus onto the single best model presents obvious limitations, especially in situations where other models are nearly as good (e.g., M∗ in panel (b) is entirely ignored during standard novelty computation). One palliative solution is to consider mixture models, or simply P (D), but this just amounts to shifting the problem into a different model class. (d) Surprise directly addresses this problem by simultaneously considering all models and by measuring how data changes the observer’s distribution of beliefs from {P (M |D1 , D2 , ..., DN −1 )}M ∈M to {P (M |D1 , D2 , ..., DN )}M ∈M over the entire model class M (orange shaded area). reparameterizations. A unit of surprise — a “wow” — may then be defined for a single model M as the amount of surprise corresponding to a two-fold variation between P (M |D) and P (M ), i.e., as log P (M |D)/P (M ) (with log taken in base 2), with the total number of wows experienced for all models obtained through the integration in eq. 2. Surprise and outlier detection. Outlier detection based on the likelihood P (D|M best ) of D given a single best model Mbest is at best an approximation to surprise and, in some cases, is misleading. Consider, for instance, a case where D has very small probability both for a model or hypothesis M and for a single alternative hypothesis M. Although D is a strong outlier, it carries very little information regarding whether M or M is the better model, and therefore very little surprise. Thus an outlier detection method would strongly focus attentional resources onto D, although D is a false positive, in the sense that it carries no useful information for discriminating between the two alternative hypotheses M and M. Figure 1 further illustrates this disconnect between outlier detection and surprise. 2 Human experiments To test the surprise hypothesis — that surprise attracts human attention and gaze in natural scenes — we recorded eye movements from eight na¨ve observers (three females and ı five males, ages 23-32, normal or corrected-to-normal vision). Each watched a subset from 50 videoclips totaling over 25 minutes of playtime (46,489 video frames, 640 × 480, 60.27 Hz, mean screen luminance 30 cd/m2 , room 4 cd/m2 , viewing distance 80cm, field of view 28◦ × 21◦ ). Clips comprised outdoors daytime and nighttime scenes of crowded environments, video games, and television broadcast including news, sports, and commercials. Right-eye position was tracked with a 240 Hz video-based device (ISCAN RK-464), with methods as previously [30]. Two hundred calibrated eye movement traces (10,192 saccades) were analyzed, corresponding to four distinct observers for each of the 50 clips. Figure 2 shows sample scanpaths for one videoclip. To characterize image regions selected by participants, we process videoclips through computational metrics that output a topographic dynamic master response map, assigning in real-time a response value to every input location. A good master map would highlight, more than expected by chance, locations gazed to by observers. To score each metric we hence sample, at onset of every human saccade, master map activity around the saccade’s future endpoint, and around a uniformly random endpoint (random sampling was repeated 100 times to evaluate variability). We quantify differences between histograms of master Figure 2: (a) Sample eye movement traces from four observers (squares denote saccade endpoints). (b) Our data exhibits high inter-individual overlap, shown here with the locations where one human saccade endpoint was nearby (≈ 5◦ ) one (white squares), two (cyan squares), or all three (black squares) other humans. (c) A metric where the master map was created from the three eye movement traces other than that being tested yields an upper-bound KL score, computed by comparing the histograms of metric values at human (narrow blue bars) and random (wider green bars) saccade targets. Indeed, this metric’s map was very sparse (many random saccades landing on locations with nearzero response), yet humans preferentially saccaded towards the three active hotspots corresponding to the eye positions of three other humans (many human saccades landing on locations with near-unity responses). map samples collected from human and random saccades using again the Kullback-Leibler (KL) distance: metrics which better predict human scanpaths exhibit higher distances from random as, typically, observers non-uniformly gaze towards a minority of regions with highest metric responses while avoiding a majority of regions with low metric responses. This approach presents several advantages over simpler scoring schemes [31, 32], including agnosticity to putative mechanisms for generating saccades and the fact that applying any continuous nonlinearity to master map values would not affect scoring. Experimental results. We test six computational metrics, encompassing and extending the state-of-the-art found in previous studies. The first three quantify static image properties (local intensity variance in 16 × 16 image patches [31]; local oriented edge density as measured with Gabor filters [33]; and local Shannon entropy in 16 × 16 image patches [34]). The remaining three metrics are more sensitive to dynamic events (local motion [33]; outlier-based saliency [33]; and surprise [25]). For all metrics, we find that humans are significantly attracted by image regions with higher metric responses. However, the static metrics typically respond vigorously at numerous visual locations (Figure 3), hence they are poorly specific and yield relatively low KL scores between humans and random. The metrics sensitive to motion, outliers, and surprising events, in comparison, yield sparser maps and higher KL scores. The surprise metric of interest here quantifies low-level surprise in image patches over space and time, and at this point does not account for high-level or cognitive beliefs of our human observers. Rather, it assumes a family of simple models for image patches, each processed through 72 early feature detectors sensitive to color, orientation, motion, etc., and computes surprise from shifts in the distribution of beliefs about which models better describe the patches (see [25] and [35] for details). We find that the surprise metric significantly outperforms all other computational metrics (p < 10−100 or better on t-tests for equality of KL scores), scoring nearly 20% better than the second-best metric (saliency) and 60% better than the best static metric (entropy). Surprising stimuli often substantially differ from simple feature outliers; for example, a continually blinking light on a static background elicits sustained flicker due to its locally outlier temporal dynamics but is only surprising for a moment. Similarly, a shower of randomly-colored pixels continually excites all low-level feature detectors but rapidly becomes unsurprising. Strongest attractors of human attention. Clearly, in our and previous eye-tracking experiments, in some situations potentially interesting targets were more numerous than in others. With many possible targets, different observers may orient towards different locations, making it more difficult for a single metric to accurately predict all observers. Hence we consider (Figure 4) subsets of human saccades where at least two, three, or all four observers simultaneously agreed on a gaze target. Observers could have agreed based on bottom-up factors (e.g., only one location had interesting visual appearance at that time), top-down factors (e.g., only one object was of current cognitive interest), or both (e.g., a single cognitively interesting object was present which also had distinctive appearance). Irrespectively of the cause for agreement, it indicates consolidated belief that a location was attractive. While the KL scores of all metrics improved when progressively focusing onto only those locations, dynamic metrics improved more steeply, indicating that stimuli which more reliably attracted all observers carried more motion, saliency, and surprise. Surprise remained significantly the best metric to characterize these agreed-upon attractors of human gaze (p < 10−100 or better on t-tests for equality of KL scores). Overall, surprise explained the greatest fraction of human saccades, indicating that humans are significantly attracted towards surprising locations in video displays. Over 72% of all human saccades were targeted to locations predicted to be more surprising than on average. When only considering saccades where two, three, or four observers agreed on a common gaze target, this figure rose to 76%, 80%, and 84%, respectively. Figure 3: (a) Sample video frames, with corresponding human saccades and predictions from the entropy, surprise, and human-derived metrics. Entropy maps, like intensity variance and orientation maps, exhibited many locations with high responses, hence had low specificity and were poorly discriminative. In contrast, motion, saliency, and surprise maps were much sparser and more specific, with surprise significantly more often on target. For three example frames (first column), saccades from one subject are shown (arrows) with corresponding apertures over which master map activity at the saccade endpoint was sampled (circles). (b) KL scores for these metrics indicate significantly different performance levels, and a strict ranking of variance < orientation < entropy < motion < saliency < surprise < human-derived. KL scores were computed by comparing the number of human saccades landing onto each given range of master map values (narrow blue bars) to the number of random saccades hitting the same range (wider green bars). A score of zero would indicate equality between the human and random histograms, i.e., humans did not tend to hit various master map values any differently from expected by chance, or, the master map could not predict human saccades better than random saccades. Among the six computational metrics tested in total, surprise performed best, in that surprising locations were relatively few yet reliably gazed to by humans. Figure 4: KL scores when considering only saccades where at least one (all 10,192 saccades), two (7,948 saccades), three (5,565 saccades), or all four (2,951 saccades) humans agreed on a common gaze location, for the static (a) and dynamic metrics (b). Static metrics improved substantially when progressively focusing onto saccades with stronger inter-observer agreement (average slope 0.56 ± 0.37 percent KL score units per 1,000 pruned saccades). Hence, when humans agreed on a location, they also tended to be more reliably predicted by the metrics. Furthermore, dynamic metrics improved 4.5 times more steeply (slope 2.44 ± 0.37), suggesting a stronger role of dynamic events in attracting human attention. Surprising events were significantly the strongest (t-tests for equality of KL scores between surprise and other metrics, p < 10−100 ). 3 Discussion While previous research has shown with either static scenes or dynamic synthetic stimuli that humans preferentially fixate regions of high entropy [34], contrast [31], saliency [32], flicker [36], or motion [37], our data provides direct experimental evidence that humans fixate surprising locations even more reliably. These conclusions were made possible by developing new tools to quantify what attracts human gaze over space and time in dynamic natural scenes. Surprise explained best where humans look when considering all saccades, and even more so when restricting the analysis to only those saccades for which human observers tended to agree. Surprise hence represents an inexpensive, easily computable approximation to human attentional allocation. In the absence of quantitative tools to measure surprise, most experimental and modeling work to date has adopted the approximation that novel events are surprising, and has focused on experimental scenarios which are simple enough to ensure an overlap between informal notions of novelty and surprise: for example, a stimulus is novel during testing if it has not been seen during training [9]. Our definition opens new avenues for more sophisticated experiments, where surprise elicited by different stimuli can be precisely compared and calibrated, yielding predictions at the single-unit as well as behavioral levels. The definition of surprise — as the distance between the posterior and prior distributions of beliefs over models — is entirely general and readily applicable to the analysis of auditory, olfactory, gustatory, or somatosensory data. While here we have focused on behavior rather than detailed biophysical implementation, it is worth noting that detecting surprise in neural spike trains does not require semantic understanding of the data carried by the spike trains, and thus could provide guiding signals during self-organization and development of sensory areas. At higher processing levels, top-down cues and task demands are known to combine with stimulus novelty in capturing attention and triggering learning [1, 38], ideas which may now be formalized and quantified in terms of priors, posteriors, and surprise. Surprise, indeed, inherently depends on uncertainty and on prior beliefs. Hence surprise theory can further be tested and utilized in experiments where the prior is biased, for ex- ample by top-down instructions or prior exposures to stimuli [38]. In addition, simple surprise-based behavioral measures such as the eye-tracking one used here may prove useful for early diagnostic of human conditions including autism and attention-deficit hyperactive disorder, as well as for quantitative comparison between humans and animals which may have lower or different priors, including monkeys, frogs, and flies. Beyond sensory biology, computable surprise could guide the development of data mining and compression systems (giving more bits to surprising regions of interest), to find surprising agents in crowds, surprising sentences in books or speeches, surprising sequences in genomes, surprising medical symptoms, surprising odors in airport luggage racks, surprising documents on the world-wide-web, or to design surprising advertisements. Acknowledgments: Supported by HFSP, NSF and NGA (L.I.), NIH and NSF (P.B.). We thank UCI’s Institute for Genomics and Bioinformatics and USC’s Center High Performance Computing and Communications (www.usc.edu/hpcc) for access to their computing clusters. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Ranganath, C. & Rainer, G. Nat Rev Neurosci 4, 193–202 (2003). Rao, R. P. & Ballard, D. H. Nat Neurosci 2, 79–87 (1999). Olshausen, B. A. & Field, D. J. Nature 381, 607–609 (1996). M¨ ller, J. R., Metha, A. B., Krauskopf, J. & Lennie, P. Science 285, 1405–1408 (1999). u Dragoi, V., Sharma, J., Miller, E. K. & Sur, M. Nat Neurosci 5, 883–891 (2002). David, S. V., Vinje, W. E. & Gallant, J. L. J Neurosci 24, 6991–7006 (2004). Maffei, L., Fiorentini, A. & Bisti, S. Science 182, 1036–1038 (1973). Movshon, J. A. & Lennie, P. Nature 278, 850–852 (1979). Fecteau, J. H. & Munoz, D. P. Nat Rev Neurosci 4, 435–443 (2003). Kurahashi, T. & Menini, A. Nature 385, 725–729 (1997). Bradley, J., Bonigk, W., Yau, K. W. & Frings, S. Nat Neurosci 7, 705–710 (2004). Ulanovsky, N., Las, L. & Nelken, I. Nat Neurosci 6, 391–398 (2003). Solomon, S. G., Peirce, J. W., Dhruv, N. T. & Lennie, P. Neuron 42, 155–162 (2004). Smirnakis, S. M., Berry, M. J. & et al. Nature 386, 69–73 (1997). Brown, S. P. & Masland, R. H. Nat Neurosci 4, 44–51 (2001). Kennedy, H. J., Evans, M. G. & et al. Nat Neurosci 6, 832–836 (2003). Schultz, W. & Dickinson, A. Annu Rev Neurosci 23, 473–500 (2000). Fletcher, P. C., Anderson, J. M., Shanks, D. R. et al. Nat Neurosci 4, 1043–1048 (2001). Knight, R. Nature 383, 256–259 (1996). Stern, C. E., Corkin, S., Gonzalez, R. G. et al. Proc Natl Acad Sci U S A 93, 8660–8665 (1996). Li, S., Cullen, W. K., Anwyl, R. & Rowan, M. J. Nat Neurosci 6, 526–531 (2003). Ebstein, R. P., Novick, O., Umansky, R. et al. Nat Genet 12, 78–80 (1996). Benjamin, J., Li, L. & et al. Nat Genet 12, 81–84 (1996). Lusher, J. M., Chandler, C. & Ball, D. Mol Psychiatry 6, 497–499 (2001). Itti, L. & Baldi, P. In Proc. IEEE CVPR. San Siego, CA (2005 in press). Cox, R. T. Am. J. Phys. 14, 1–13 (1964). Savage, L. J. The foundations of statistics (Dover, New York, 1972). (First Edition in 1954). Jaynes, E. T. Probability Theory. The Logic of Science (Cambridge University Press, 2003). Kullback, S. Information Theory and Statistics (Wiley, New York:New York, 1959). Itti, L. Visual Cognition (2005 in press). Reinagel, P. & Zador, A. M. Network 10, 341–350 (1999). Parkhurst, D., Law, K. & Niebur, E. Vision Res 42, 107–123 (2002). Itti, L. & Koch, C. Nat Rev Neurosci 2, 194–203 (2001). Privitera, C. M. & Stark, L. W. IEEE Trans Patt Anal Mach Intell 22, 970–982 (2000). All source code for all metrics is freely available at http://iLab.usc.edu/toolkit/. Theeuwes, J. Percept Psychophys 57, 637–644 (1995). Abrams, R. A. & Christ, S. E. Psychol Sci 14, 427–432 (2003). Wolfe, J. M. & Horowitz, T. S. Nat Rev Neurosci 5, 495–501 (2004).

6 0.58906764 131 nips-2005-Multiple Instance Boosting for Object Detection

7 0.56891078 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

8 0.54526913 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

9 0.52913147 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

10 0.50187808 151 nips-2005-Pattern Recognition from One Example by Chopping

11 0.48653439 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

12 0.43460569 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

13 0.43313754 203 nips-2005-Visual Encoding with Jittering Eyes

14 0.42418647 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

15 0.40540519 149 nips-2005-Optimal cue selection strategy

16 0.39459226 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

17 0.38346216 98 nips-2005-Infinite latent feature models and the Indian buffet process

18 0.38032389 35 nips-2005-Bayesian model learning in human visual perception

19 0.37384725 156 nips-2005-Prediction and Change Detection

20 0.31842035 79 nips-2005-Fusion of Similarity Data in Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.069), (10, 0.042), (25, 0.012), (27, 0.024), (31, 0.029), (34, 0.061), (39, 0.442), (55, 0.015), (69, 0.051), (73, 0.041), (88, 0.076), (91, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91298079 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.

2 0.78389943 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

Author: Robert A. Legenstein, Wolfgang Maass

Abstract: We investigate under what conditions a neuron can learn by experimentally supported rules for spike timing dependent plasticity (STDP) to predict the arrival times of strong “teacher inputs” to the same neuron. It turns out that in contrast to the famous Perceptron Convergence Theorem, which predicts convergence of the perceptron learning rule for a simplified neuron model whenever a stable solution exists, no equally strong convergence guarantee can be given for spiking neurons with STDP. But we derive a criterion on the statistical dependency structure of input spike trains which characterizes exactly when learning with STDP will converge on average for a simple model of a spiking neuron. This criterion is reminiscent of the linear separability criterion of the Perceptron Convergence Theorem, but it applies here to the rows of a correlation matrix related to the spike inputs. In addition we show through computer simulations for more realistic neuron models that the resulting analytically predicted positive learning results not only hold for the common interpretation of STDP where STDP changes the weights of synapses, but also for a more realistic interpretation suggested by experimental data where STDP modulates the initial release probability of dynamic synapses. 1

3 0.74942905 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artificial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is defined by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.

4 0.69994551 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

5 0.54922634 149 nips-2005-Optimal cue selection strategy

Author: Vidhya Navalpakkam, Laurent Itti

Abstract: Survival in the natural world demands the selection of relevant visual cues to rapidly and reliably guide attention towards prey and predators in cluttered environments. We investigate whether our visual system selects cues that guide search in an optimal manner. We formally obtain the optimal cue selection strategy by maximizing the signal to noise ratio (SN R) between a search target and surrounding distractors. This optimal strategy successfully accounts for several phenomena in visual search behavior, including the effect of target-distractor discriminability, uncertainty in target’s features, distractor heterogeneity, and linear separability. Furthermore, the theory generates a new prediction, which we verify through psychophysical experiments with human subjects. Our results provide direct experimental evidence that humans select visual cues so as to maximize SN R between the targets and surrounding clutter.

6 0.49897259 203 nips-2005-Visual Encoding with Jittering Eyes

7 0.45715609 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

8 0.45080242 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

9 0.44652906 151 nips-2005-Pattern Recognition from One Example by Chopping

10 0.44363245 34 nips-2005-Bayesian Surprise Attracts Human Attention

11 0.42450017 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

12 0.42437926 169 nips-2005-Saliency Based on Information Maximization

13 0.41835961 30 nips-2005-Assessing Approximations for Gaussian Process Classification

14 0.41754937 98 nips-2005-Infinite latent feature models and the Indian buffet process

15 0.4137519 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

16 0.40163755 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

17 0.40147957 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

18 0.39546019 35 nips-2005-Bayesian model learning in human visual perception

19 0.39042851 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

20 0.38984227 181 nips-2005-Spiking Inputs to a Winner-take-all Network