nips nips2004 nips2004-53 knowledge-graph by maker-knowledge-mining

53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

Source: pdf

Author: Dashan Gao, Nuno Vasconcelos

Abstract: Saliency mechanisms play an important role when visual recognition must be performed in cluttered scenes. We propose a computational deﬁnition of saliency that deviates from existing models by equating saliency to discrimination. In particular, the salient attributes of a given visual class are deﬁned as the features that enable best discrimination between that class and all other classes of recognition interest. It is shown that this deﬁnition leads to saliency algorithms of low complexity, that are scalable to large recognition problems, and is compatible with existing models of early biological vision. Experimental results demonstrating success in the context of challenging recognition problems are also presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Discriminant Saliency for Visual Recognition from Cluttered Scenes Dashan Gao Nuno Vasconcelos Department of Electrical and Computer Engineering, University of California, San Diego Abstract Saliency mechanisms play an important role when visual recognition must be performed in cluttered scenes. [sent-1, score-0.179]

2 We propose a computational deﬁnition of saliency that deviates from existing models by equating saliency to discrimination. [sent-2, score-1.807]

3 In particular, the salient attributes of a given visual class are deﬁned as the features that enable best discrimination between that class and all other classes of recognition interest. [sent-3, score-0.591]

4 It is shown that this deﬁnition leads to saliency algorithms of low complexity, that are scalable to large recognition problems, and is compatible with existing models of early biological vision. [sent-4, score-1.109]

5 Experimental results demonstrating success in the context of challenging recognition problems are also presented. [sent-5, score-0.077]

6 1 Introduction The formulation of recognition as a problem of statistical classiﬁcation has enabled signiﬁcant progress in the area, over the last decades. [sent-6, score-0.077]

7 classiﬁers that achieve high recognition rates in the absence of a few factors that stress their robustness (e. [sent-10, score-0.099]

8 Recent advances have also shown that real-time recognition is possible on low-end hardware [1]. [sent-14, score-0.077]

9 Given all this progress, it appears that one of the fundamental barriers remaining in the path to a vision of scalable recognition systems, capable of dealing with large numbers of visual classes, is an issue that has not traditionally been considered as problematic: training complexity. [sent-15, score-0.246]

10 Typically these training sets are large (in the order of thousands of examples per class) and require a combination of 1) painstaking manual labor of image inspection and segmentation of good examples (e. [sent-17, score-0.096]

11 X is the person sitting at the end of the room, the one with gray hair”), current faces detectors require training faces to be cropped into (a) (b) (c) (d) Figure 1: (a)(b)(c) Various challenging examples for current saliency detectors. [sent-23, score-1.05]

12 From left to right, top to bottom, detectors of: edges, bars, corners, t-junctions, spots, ﬂow patches, and checkerbords. [sent-26, score-0.083]

13 One property of biological vision that plays an important role in this ability to learn from highly cluttered examples, is the existence of saliency mechanisms. [sent-28, score-1.033]

14 Instead, salient locations simply pop-out in result of the operation of pre-recognition saliency mechanisms. [sent-30, score-1.137]

15 While saliency has been the subject of signiﬁcant study in computer vision, most formulations do not pose saliency itself as a major goal of recognition. [sent-31, score-1.815]

16 Instead saliency is usually an auxiliary pre-ﬁltering step, whose goal is to reduce computation by eliminating image locations that can be universally classiﬁed as non-salient, according to some deﬁnition of saliency which is completely divorced from the particular recognition problem at hand. [sent-32, score-1.958]

17 In this work, we propose an alternative deﬁnition of saliency, which we denote by discriminant saliency, and which is intrinsically grounded on the recognition problem. [sent-33, score-0.17]

18 This new formulation is based on the intuition that, for recognition, the salient features of a visual class are those that best distinguish it from all other visual classes of recognition interest. [sent-34, score-0.54]

19 We present experimental results demonstrating success on image databases containing complex scenes and substantial amounts of clutter. [sent-36, score-0.087]

20 2 Saliency detection The extraction of salient points from images has been a subject of research for at least a few decades. [sent-37, score-0.296]

21 Broadly speaking, saliency detectors can be divided into three major classes. [sent-38, score-0.99]

22 The ﬁrst, and most popular, treats the problem as one of the detection of speciﬁc visual attributes. [sent-39, score-0.12]

23 These are usually edges or corners (also called “interest points”) [2] and their detection has roots in the structure-from-motion literature, but there have also been proposals for other low-level visual attributes such as contours [3]. [sent-40, score-0.301]

24 A major limitation of these saliency detectors is that they do not generalize well. [sent-41, score-0.99]

25 For example, a corner detector will always produce a stronger response in a region that is strongly textured than in a smooth region, even though textured surfaces are not necessarily more salient than smooth ones. [sent-42, score-0.413]

26 While a corner detector would respond strongly to the highly textured regions of leaves and tree branches, it is not clear that these are more salient than the smooth apple. [sent-44, score-0.384]

27 One idea that has recently gained some popularity is to deﬁne saliency as image complexity. [sent-47, score-0.956]

28 The main advantage of these data-driven deﬁnitions of saliency is a signiﬁcantly greater ﬂexibility, as they could detect any of the low-level attributes above (corners, contours, smooth edges, etc. [sent-50, score-0.958]

29 It is not clear, however, that saliency can always be equated with complexity. [sent-52, score-0.891]

30 On the contrary, the much less complex image regions containing the bird or the egg appear to be signiﬁcantly more salient. [sent-54, score-0.129]

31 Finally, a third formulation is to start from models of biological vision, and derive saliency detection algorithms from these models [7]. [sent-55, score-0.983]

32 This formulation has the appeal of its roots on what are the only known full-functioning vision systems, and it has been shown to lead to interesting saliency behavior [7]. [sent-56, score-0.984]

33 3 Discriminant saliency The basic intuition for discriminant saliency is somewhat of a “statement of the obvious”: the salient attributes of a given visual concept are the attributes that most distinguish it from all other visual concepts that may be of possible interest. [sent-59, score-2.312]

34 While close to obvious, this deﬁnition is a major departure from all existing deﬁnitions in the vision literature. [sent-60, score-0.107]

35 tracking or structure-from-motion where many of the current saliency detectors have roots [2]), it is an intrinsic component of the recognition problem: the set of visual classes to be recognized. [sent-64, score-1.169]

36 It therefore makes saliency contingent upon the existence of a collection of classes and, therefore, impossible to compute from an isolated image. [sent-65, score-0.928]

37 It also means that, for a given object, different visual attributes will be salient in different recognition contexts. [sent-66, score-0.393]

38 For example while contours and shape will be most salient to distinguish a red apple from a red car, color and texture will be most salient when the same apple is compared to an orange. [sent-67, score-0.606]

39 Second, it sets as a goal for saliency that of distinguishing between classes. [sent-69, score-0.891]

40 This implies that the optimality criterion for the design of salient features is discrimination, and therefore very different from traditional criteria such as repetitiveness under image transformations [8]. [sent-70, score-0.398]

41 Robustness in terms of these criteria (which, once again, are well justiﬁed for tracking but do not address the essence of the recognition problem) can be learned if needed to achieve discriminant goals [9]. [sent-71, score-0.196]

42 Due to this equivalence between saliency and discrimination, the principle of discriminant saliency can be easily translated into an optimality criteria for the design of saliency algorithms. [sent-72, score-2.887]

43 In particular, it is naturally formulated as an optimal feature selection problem: optimal features for saliency are the most discriminant features for the one-vs-all classiﬁcation problem that opposes the class of interest to all remaining classes. [sent-73, score-1.197]

44 Or, in other words, the most salient features are the ones that best separate the class of interest from all others. [sent-74, score-0.333]

45 Given the well known equivalence between features and image ﬁlters, this can also be seen as a problem of designing optimal ﬁlters for discrimination. [sent-75, score-0.123]

46 It is therefore important to adopt feature selection techniques that are computationally efﬁcient, preferably reusing computation from the design of one classiﬁer to the next. [sent-78, score-0.087]

47 The design of such feature selection methods is a non-trivial problem, which we have been actively pursuing in the context of research in feature selection itself [11]. [sent-79, score-0.14]

48 Our experience of applying algorithms in this family to the saliency detection problem is that, even those strongly biased towards efﬁciency can consistently select good saliency detection ﬁlters. [sent-81, score-1.939]

49 This is illustrated by all the results presented in this paper, where we have adopted the maximization of marginal diversity (MMD) [10] as the guiding principle for feature selection. [sent-82, score-0.128]

50 Given a classiﬁcation problem with class labels Y , prior class probabilities PY (i), a set of n features, X = (X1 , . [sent-83, score-0.1]

51 , Xn ), and such that the probability density of Xk given class i is PXk |Y (x|i), the marginal diversity (MD) of feature Xk is md(Xk ) =< KL[PXk |Y (x|i)||PXk (x) >Y (1) M where < f (i) >Y = i=1 PY (i)f (i), and KL[p||q] = p(s) log p(x) dx the Kullbackq(x) Leibler divergence between p and q. [sent-86, score-0.116]

52 Furthermore, in the one-vs-all classiﬁcation scenario, the histogram of the “all” class can be obtained by a weighted average of the class conditional histograms of the image classes that it contains, i. [sent-88, score-0.224]

53 PXk |Y (x|A) = PXk |Y (x|i)PY (i) (2) i∈A where A is the set of image classes that compose the “all” class. [sent-90, score-0.102]

54 This implies that the bulk of the computation, the density estimation step, only has to be performed once for the design of all saliency detectors. [sent-91, score-0.925]

55 ) it has been known that, the earliest stages of biological vision can be modeled as a multiresolution image decomposition followed by some type of non-linearity. [sent-96, score-0.177]

56 Indeed, various “biologically plausible” models of early vision are based on this principle [12]. [sent-97, score-0.127]

57 The equivalence between saliency detection and the design of optimally discriminant ﬁlters, makes discriminant saliency compatible with most of these models. [sent-98, score-2.11]

58 In fact, as detailed in the experimental section, our experience is that remarkably simple mechanisms, inspired by biological vision, are sufﬁcient to achieve good saliency results. [sent-99, score-0.942]

59 a function describing the saliency at each pixel location) is obtained by pooling the responses of the different saliency ﬁlters after half-wave rectiﬁcation 2n S(x, y) = 2 ωi Ri (x, y), (3) i=1 where S(x, y) is the saliency at location (x, y), Ri (x, y), i = 1, . [sent-103, score-2.673]

60 , 2n the channels resulting from half-wave rectiﬁcation of the outputs of the saliency ﬁlters Fi (x, y), i = 1, . [sent-106, score-0.891]

61 Second, the saliency map is fed to a peak detection module that consists of a winnertake-all network. [sent-110, score-0.977]

62 Its spatial scale is set to the size of the region of support of the saliency ﬁlter with strongest response at that location. [sent-112, score-0.907]

63 The procedure is illustrated by Figure 2, and produces Scale Selection R1 *F1 R2 wi I *Fj Saliency Map WTA Salient Locations *Fn R2n Figure 2: Schematic of the saliency detection model. [sent-114, score-0.978]

64 a list of salient locations, their saliency strengths, and scales. [sent-115, score-1.103]

65 It is important to limit the number of channels that contribute to the saliency map since, for any given class, there are usually many features which are not discriminant but have strong response at various image locations (e. [sent-116, score-1.165]

66 While the precise set of features is likely not to be crucial for the quality of the saliency results (e. [sent-122, score-0.93]

67 First, previous experience has shown that they perform quite well in large scale recognition problems [14]. [sent-125, score-0.118]

68 Second, as illustrated by Figure 1(d), the DCT basis functions contain detectors for various perceptually relevant low-level image attributes, including edges, bars, corners, t-junctions, spots, etc. [sent-126, score-0.192]

69 This can obviously only make the saliency detection process easier. [sent-127, score-0.957]

70 4 Results and discussion We start the experimental evaluation of discriminant saliency with some results from the Brodatz texture database, that illustrate various interesting properties of the former. [sent-128, score-1.067]

71 The training database was used to determine the salient features of each class, and saliency maps were then computed on the test images. [sent-135, score-1.221]

72 The process was repeated for all texture classes, on a one-vs-all setting (class of interest against all others) with each class sequentially considered as the “one” class. [sent-136, score-0.142]

73 As illustrated by the examples shown in Figure 3, none of the challenges posed by Brodatz seem very problematic for discriminant saliency. [sent-137, score-0.135]

74 This robustness is a consequence of the fact that the saliency features are tuned to discriminate the class of interest from the rest. [sent-139, score-1.012]

75 We next show that it can lead to signiﬁcantly better saliency detection performance than that achievable with the algorithms currently available in the literature. [sent-140, score-0.957]

76 Figure 3: Saliency maps (bottom row) obtained on various textures (shown in top row) from Brodatz. [sent-141, score-0.091]

77 Note: the saliency maps of the second row are best viewed on paper. [sent-143, score-0.966]

78 2 Comparison to existing methods While the results of the previous section provide interesting anecdotal evidence in support of discriminant saliency, objective conclusions can only be drawn by comparison to existing techniques. [sent-165, score-0.143]

79 Unfortunately, it is not always straightforward to classify saliency detectors objectively by simple inspection of saliency maps, since different people frequently attribute different degrees of saliency to a given image region. [sent-166, score-2.852]

80 To avoid the obvious biases inherent to a subjective evaluation of saliency maps, we tried to design an experiment that could lead to an objective comparison. [sent-170, score-0.925]

81 The goal was to quantify if the saliency maps produced by the different techniques contained enough information for recognition. [sent-171, score-0.941]

82 If, when applied to an image, a saliency detector has an output which is highly correlated with the appearance/absence of the class of interest in that image, then it should be possible to classify the image (as belonging/not belonging to the class) by classifying the saliency map itself. [sent-173, score-2.017]

83 We then built the simplest possible saliency map classiﬁer that we could conceive of: the intensity values of the saliency map were histogrammed and fed to a support vector machine (SVM) classiﬁer. [sent-174, score-1.822]

84 We compared the performance of the discriminant saliency detector (DSD) described above, with one representative from each of the areas of the literature discussed in section 2: the Harris saliency detector (HSD) and the scale saliency detector (SSD) of [6]. [sent-175, score-2.947]

85 To evaluate performance on a generic recognition scenario, we adopted the Caltech database, using the experimental set up proposed in [15]. [sent-176, score-0.098]

86 For the simple classiﬁer, we reduced the luminance component of each image to a vector (by stacking all pixels into a column) and used a SVM to classify the resulting set of points. [sent-178, score-0.082]

87 All parameters were set to assure a fair comparison between the saliency detectors (e. [sent-179, score-0.974]

88 a multiscale version of Harris was employed, all detectors combined information from three scales, etc. [sent-181, score-0.083]

89 Table 1 presents the two benchmarks and the results of classifying the saliency histograms generated by the three detectors. [sent-183, score-0.932]

90 First, both the HSD and the SSD have Figure 4: Original images (top row), saliency maps generated by DSD (second row), and a comparison of salient locations detected by: DSD in the third row, SSD in the fourth, and HSD at the bottom. [sent-185, score-1.205]

91 Note: the saliency maps of the second row are best viewed on paper. [sent-187, score-0.966]

92 ps very poor performance, indicating that they produce saliency maps that have weak correlation with the presence/absence of the class of interest in the image to classify. [sent-192, score-1.088]

93 While the question of interest here is “is class x present in the image or not? [sent-197, score-0.147]

94 In any case, these results seem to support the claim that DSD produces saliency maps which contain most of the saliency information required for classiﬁcation. [sent-201, score-1.832]

95 The issue of translating these saliency maps into a combined segmentation/recognition solution will be addressed in future research. [sent-202, score-0.941]

96 Finally, the superiority of the DSD over the other two saliency detectors considered in this experiment is also clearly supported by the inspection of the resulting salient locations. [sent-203, score-1.217]

97 Note that, while the training examples from each class are not carefully segmented (and can contain large areas of clutter), the working assumption is that each image is labeled with respect to the presence or absence in it of the class of interest. [sent-207, score-0.221]

98 separation of the pixels containing objects in the class and pixels of background) takes place. [sent-210, score-0.084]

99 Visual inspection of the saliency detection results obtained with feature sets within this range showed no substantial differences with respect to that obtained with the optimal feature set. [sent-217, score-1.046]

100 Structural saliency: the detection of globally salient structures using a locally connected network. [sent-232, score-0.278]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('saliency', 0.891), ('salient', 0.212), ('dsd', 0.103), ('discriminant', 0.093), ('detectors', 0.083), ('recognition', 0.077), ('vision', 0.066), ('detection', 0.066), ('image', 0.065), ('pxk', 0.064), ('texture', 0.06), ('visual', 0.054), ('classi', 0.052), ('dct', 0.051), ('hsd', 0.051), ('attributes', 0.05), ('class', 0.05), ('maps', 0.05), ('detector', 0.049), ('scalable', 0.049), ('textured', 0.047), ('ssd', 0.045), ('corners', 0.041), ('contours', 0.039), ('features', 0.039), ('classes', 0.037), ('design', 0.034), ('locations', 0.034), ('apple', 0.033), ('brodatz', 0.033), ('er', 0.033), ('interest', 0.032), ('cluttered', 0.031), ('inspection', 0.031), ('constellation', 0.031), ('harris', 0.031), ('database', 0.029), ('feature', 0.029), ('lters', 0.027), ('faces', 0.027), ('roots', 0.027), ('py', 0.027), ('biological', 0.026), ('criteria', 0.026), ('bird', 0.026), ('sebe', 0.026), ('experience', 0.025), ('existing', 0.025), ('md', 0.025), ('row', 0.025), ('edges', 0.024), ('corner', 0.024), ('selection', 0.024), ('various', 0.023), ('compatible', 0.023), ('clutter', 0.023), ('optimality', 0.022), ('absence', 0.022), ('egg', 0.022), ('cropped', 0.022), ('crt', 0.022), ('histograms', 0.022), ('scenes', 0.022), ('discrimination', 0.022), ('marginal', 0.021), ('illustrated', 0.021), ('problematic', 0.021), ('adopted', 0.021), ('multiresolution', 0.02), ('vasconcelos', 0.02), ('kadir', 0.02), ('motorbikes', 0.02), ('spots', 0.02), ('object', 0.02), ('gradients', 0.02), ('principle', 0.02), ('map', 0.02), ('bars', 0.019), ('highly', 0.019), ('benchmarks', 0.019), ('hair', 0.019), ('equivalence', 0.019), ('cation', 0.019), ('images', 0.018), ('early', 0.018), ('textures', 0.018), ('areas', 0.018), ('distinguish', 0.017), ('smooth', 0.017), ('formulations', 0.017), ('mechanisms', 0.017), ('pixels', 0.017), ('segmented', 0.016), ('fk', 0.016), ('xk', 0.016), ('major', 0.016), ('scale', 0.016), ('regions', 0.016), ('diversity', 0.016), ('viewing', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

Author: Dashan Gao, Nuno Vasconcelos

2 0.19601142 21 nips-2004-An Information Maximization Model of Eye Movements

Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik

Abstract: We propose a sequential information maximization model as a general strategy for programming eye movements. The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. Modeling additional biological constraints will improve the prediction of fixation sequences. Our results suggest that information maximization is a useful principle for programming eye movements. 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. 1.1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. This approach is intuitive and may be biologically plausible, as outlined by Lee & Yu [9]. The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. This is the reason we must move our eyes in the first place, yet it is often ignored. A sequence of eye movements may then be understood within a framework of sequential information maximization. 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. Observers are asked to carefully study a shape and then discriminate it from a highly similar one. 2.1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized with a single fixation. During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. When the fixation marker disappeared, subjects were allowed to study the object for 1.2 seconds while their eye movements were recorded. During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). Performance was near 75% correct, indicating that the task was challenging yet feasible. Subjects saw 140 shapes and given auditory feedback. release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? fixate, initiate trial Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of learned shape from a highly similar one (right). 2.2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A 25 dot grid that covered the extent of the presentation field was used for calibration. The points were measured one at a time with each dot being displayed for 500ms. The stimuli were presented using the Psychtoolbox software [10]. 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. 3.1 Representing information The information in silhouettes clearly resides at its contour, which we represent with a collection of points and associated tangent orientations. These points and their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the total number of edgelets along the boundary. Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. 3.2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, ... eN. Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. Our measurements are defined to model the resolution limitations of the human visual system, with highest resolution at the fovea and lower resolution in the 1 Although the visual system does not have precise knowledge of location coordinates, the model is greatly simplified by assuming this knowledge. It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. periphery. Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0.1 and E2=0.8 [14]. Our model represents pooled information as a histogram of edge orientations within the effective radius. For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i.e. E = xi − f . To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei.2 Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). Left and right panels show two fixations for the same object. Up to this point we have restricted ourselves to the case of a single fixation. To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . 2 δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y . Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . This exact distribution is intractable, so we will use a simple approximation. We assume the distribution factors over individual edgelets: N ( ( ( P(zi , z 2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK ) }) = ∏ g i(zi ) i =1 where gi(zi) is the marginal distribution of orientation zi. Determining these marginal distributions is still difficult even with the factorization assumption, so we K will make an additional approximation: g (z ) = 1 ∏ hiz( k ) , where Zi is a suitable i i Z i k =1 (k ) normalization factor. This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. 3.3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. In other words, f is chosen to minimize r ( K +1) ( ( ( H( f ) = entropy[ P(zi , z2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK +1) })] , where the entropy of a distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image.3 We note that this selection rule makes decisions that depend, in general, on the full history of previous K fixations. 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). The information maximization model updates its prediction after each fixation. An ideal sequence of fixations can be generated by both models. The saliency model selects fixations in order of decreasing salience. The information maximization model selects the maximally informative point after incorporating information from the previous fixations. To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. A practical decision rule would use current knowledge to estimate the expected (rather than actual) entropy. Figure 3. Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). model that selects fixations at random. One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left). The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). Sequence Error Visual Angle (deg) Location Error R S I R S I R S I R S I R S I R S I Figure 4. Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). The left panel shows the error in predicting fixation locations, ignoring sequence. The right panel shows the error when sequence is retained before mapping. Error bars are 95% confidence intervals. The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. First, saccade amplitudes are typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce this cost. This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. Several fixations may therefore be made before sampled information begins to influence target selection. The information maximization model currently updates after each fixation. This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). Figure 6. Three fixations are made to a location that is initially highly informative according to the information maximization model. By the fourth fixation, the subject finally moves to the next most informative point. 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. In effect, the history of where we have been matters because our model is continually collecting information from the stimulus. We have an implicit “inhibition-of-return” because there is little to be gained by revisiting a point. Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. This framework is portable to any image or task. A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. Acknowledgments Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461. References [1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press. [2] Yarbus (1967). Eye movements and vision. New York: Plenum Press. [3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489-1506. [4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of Computer Vision, 45(2), 83-105. [5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123. [6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306. [7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain. [8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311. [9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic behaviors. Advanced in Neural Processing Systems, 12, 834-840. [10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436. [11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an ideal-observer model of reading. Vision Research, 42, 2219-2234. [12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Analysis and Machine Intel, 18(1), 1-14. [13] Renninger & Malik (2004). Sequential information maximization can explain eye movements in an object learning task. Journal of Vision, 4(8), 744a. [14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification. Vision Research, 25(7), 963-977. [15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469. [16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual pathway during saccadic eye movements. Nature, 371, 511-513. [17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90. [18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual search. Vision Research, 40, 2499-2516.

3 0.10401078 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

Abstract: We address the problem of identifying speciﬁc instances of a class (cars) from a set of images all belonging to that class. Although we cannot build a model for any particular instance (as we may be provided with only one “training” example of it), we can use information extracted from observing other members of the class. We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. We explore a patch based representation, where we model the distributions of similarity measurements deﬁned on the patches. Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. This algorithm performs identiﬁcation well for our challenging dataset of car images, after matching only a few, well chosen patches. 1

4 0.071068071 40 nips-2004-Common-Frame Model for Object Recognition

Author: Pierre Moreels, Pietro Perona

Abstract: A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by drawing a set of objects from a given database, with random clutter sprinkled on the remaining image surface. Occlusion is allowed. We study the case where features from the same object share a common reference frame. Moreover, parameters for shape and appearance densities are shared across features. This is to be contrasted with previous work on probabilistic ‘constellation’ models where features depend on each other, and each feature and model have different pose and appearance statistics [1, 2]. These two differences allow us to build models containing hundreds of features, as well as to train each model from a single example. Our model may also be thought of as a probabilistic revisitation of Lowe’s model [3, 4]. We propose an efﬁcient entropy-minimization inference algorithm that constructs the best interpretation of a scene as a collection of objects and clutter. We test our ideas with experiments on two image databases. We compare with Lowe’s algorithm and demonstrate better performance, in particular in presence of large amounts of background clutter.

5 0.063550867 88 nips-2004-Intrinsically Motivated Reinforcement Learning

Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh

Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a speciﬁc problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efﬁciently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artiﬁcial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1

6 0.062939025 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

7 0.058751766 44 nips-2004-Conditional Random Fields for Object Recognition

8 0.057502884 192 nips-2004-The power of feature clustering: An application to object detection

9 0.05409297 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

10 0.052178886 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

11 0.04767435 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

12 0.047324989 68 nips-2004-Face Detection --- Efficient and Rank Deficient

13 0.044269983 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

14 0.040650152 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

15 0.039831284 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

16 0.038688563 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

17 0.037042975 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

18 0.036759835 83 nips-2004-Incremental Learning for Visual Tracking

19 0.036616951 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

20 0.036333539 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.123), (1, 0.028), (2, -0.026), (3, -0.094), (4, 0.075), (5, 0.094), (6, 0.082), (7, -0.083), (8, -0.01), (9, 0.026), (10, -0.063), (11, 0.052), (12, 0.027), (13, -0.026), (14, -0.008), (15, -0.029), (16, -0.048), (17, 0.006), (18, 0.024), (19, -0.025), (20, 0.077), (21, 0.009), (22, 0.049), (23, -0.063), (24, 0.063), (25, 0.017), (26, 0.082), (27, -0.002), (28, -0.073), (29, 0.022), (30, -0.102), (31, -0.067), (32, 0.01), (33, -0.078), (34, 0.183), (35, -0.069), (36, 0.002), (37, -0.145), (38, 0.151), (39, 0.16), (40, 0.239), (41, -0.198), (42, -0.16), (43, -0.277), (44, -0.21), (45, -0.122), (46, -0.023), (47, -0.033), (48, -0.024), (49, 0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94026786 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

Author: Dashan Gao, Nuno Vasconcelos

2 0.86122233 21 nips-2004-An Information Maximization Model of Eye Movements

Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik

3 0.36184612 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

4 0.33709905 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

Author: Tanzeem Choudhury, Sumit Basu

Abstract: In this work, we quantitatively investigate the ways in which a given person influences the joint turn-taking behavior in a conversation. After collecting an auditory database of social interactions among a group of twenty-three people via wearable sensors (66 hours of data each over two weeks), we apply speech and conversation detection methods to the auditory streams. These methods automatically locate the conversations, determine their participants, and mark which participant was speaking when. We then model the joint turn-taking behavior as a Mixed-Memory Markov Model [1] that combines the statistics of the individual subjects' self-transitions and the partners ' cross-transitions. The mixture parameters in this model describe how much each person 's individual behavior contributes to the joint turn-taking behavior of the pair. By estimating these parameters, we thus estimate how much influence each participant has in determining the joint turntaking behavior. We show how this measure correlates significantly with betweenness centrality [2], an independent measure of an individual's importance in a social network. This result suggests that our estimate of conversational influence is predictive of social influence. 1

5 0.30374339 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

Author: Kumar Chellapilla, Patrice Y. Simard

Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.

6 0.2798551 88 nips-2004-Intrinsically Motivated Reinforcement Learning

7 0.27220681 192 nips-2004-The power of feature clustering: An application to object detection

8 0.26811647 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

9 0.24943538 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

10 0.22403784 68 nips-2004-Face Detection --- Efficient and Rank Deficient

11 0.22348964 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

12 0.22317854 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

13 0.22269747 109 nips-2004-Mass Meta-analysis in Talairach Space

14 0.21086857 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

15 0.19824161 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

16 0.19486174 40 nips-2004-Common-Frame Model for Object Recognition

17 0.19446932 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

18 0.19180535 193 nips-2004-Theories of Access Consciousness

19 0.18829535 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

20 0.18691275 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.092), (15, 0.147), (17, 0.016), (24, 0.025), (25, 0.012), (26, 0.045), (27, 0.013), (31, 0.031), (33, 0.179), (35, 0.028), (36, 0.234), (39, 0.018), (50, 0.038), (56, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97201061 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

Author: Bernd Fischer, Volker Roth, Jonas Grossmann, Sacha Baginsky, Wilhelm Gruissem, Franz Roos, Peter Widmayer, Joachim M. Buhmann

Abstract: De novo Sequencing of peptides is a challenging task in proteome research. While there exist reliable DNA-sequencing methods, the highthroughput de novo sequencing of proteins by mass spectrometry is still an open problem. Current approaches suffer from a lack in precision to detect mass peaks in the spectrograms. In this paper we present a novel method for de novo peptide sequencing based on a hidden Markov model. Experiments effectively demonstrate that this new method significantly outperforms standard approaches in matching quality. 1

2 0.86548132 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

Author: Máté Lengyel, Peter Dayan

Abstract: Areas of the brain involved in various forms of memory exhibit patterns of neural activity quite unlike those in canonical computational models. We show how to use well-founded Bayesian probabilistic autoassociative recall to derive biologically reasonable neuronal dynamics in recurrently coupled models, together with appropriate values for parameters such as the membrane time constant and inhibition. We explicitly treat two cases. One arises from a standard Hebbian learning rule, and involves activity patterns that are coded by graded ﬁring rates. The other arises from a spike timing dependent learning rule, and involves patterns coded by the phase of spike times relative to a coherent local ﬁeld potential oscillation. Our model offers a new and more complete understanding of how neural dynamics may support autoassociation. 1

same-paper 3 0.84802246 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

Author: Dashan Gao, Nuno Vasconcelos

4 0.73401403 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

Author: Nils Bertschinger, Thomas Natschläger, Robert A. Legenstein

Abstract: In this paper we analyze the relationship between the computational capabilities of randomly connected networks of threshold gates in the timeseries domain and their dynamical properties. In particular we propose a complexity measure which we ﬁnd to assume its highest values near the edge of chaos, i.e. the transition from ordered to chaotic dynamics. Furthermore we show that the proposed complexity measure predicts the computational capabilities very well: only near the edge of chaos are such networks able to perform complex computations on time series. Additionally a simple synaptic scaling rule for self-organized criticality is presented and analyzed. 1

5 0.73117012 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.7304824 178 nips-2004-Support Vector Classification with Input Data Uncertainty

7 0.73011547 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.73003072 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

9 0.72427583 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

10 0.72393227 127 nips-2004-Neighbourhood Components Analysis

11 0.72383499 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

12 0.72301674 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

13 0.7230041 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

14 0.72290939 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

15 0.72243267 40 nips-2004-Common-Frame Model for Object Recognition

16 0.72171277 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

17 0.72163695 102 nips-2004-Learning first-order Markov models for control

18 0.72134191 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

19 0.72119945 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

20 0.72104365 187 nips-2004-The Entire Regularization Path for the Support Vector Machine