nips nips2003 nips2003-37 knowledge-graph by maker-knowledge-mining

37 nips-2003-Automatic Annotation of Everyday Movements

Source: pdf

Author: Deva Ramanan, David A. Forsyth

Abstract: This paper describes a system that can annotate a video sequence with: a description of the appearance of each actor; when the actor is in view; and a representation of the actor’s activity while in view. The system does not require a ﬁxed background, and is automatic. The system works by (1) tracking people in 2D and then, using an annotated motion capture dataset, (2) synthesizing an annotated 3D motion sequence matching the 2D tracks. The 3D motion capture data is manually annotated off-line using a class structure that describes everyday motions and allows motion annotations to be composed — one may jump while running, for example. Descriptions computed from video of real motions show that the method is accurate.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper describes a system that can annotate a video sequence with: a description of the appearance of each actor; when the actor is in view; and a representation of the actor’s activity while in view. [sent-7, score-0.36]

2 The system works by (1) tracking people in 2D and then, using an annotated motion capture dataset, (2) synthesizing an annotated 3D motion sequence matching the 2D tracks. [sent-9, score-1.131]

3 The 3D motion capture data is manually annotated off-line using a class structure that describes everyday motions and allows motion annotations to be composed — one may jump while running, for example. [sent-10, score-1.381]

4 Introduction It would be useful to have a system that could take large volumes of video data of people engaged in everyday activities and produce annotations of that data with statements about the activities of the actors. [sent-13, score-0.548]

5 We then synthesize 3D motion sequences matching our 2D tracks using a collection of annotated motion capture data, and then apply the annotations of the synthesized sequence to the video. [sent-17, score-1.443]

6 Because people do not change in appearance from frame to frame, a practical strategy is to cluster an appearance model for each possible person over the sequence, and then use these models to drive detection. [sent-20, score-0.327]

7 This yields a tracker that is capable of meeting all our criteria, described in greater detail in [14]; we used the tracker of that paper. [sent-21, score-0.276]

8 Leventon and Freeman show that tracks can be signiﬁcantly improved by comparison with human motion [12]. [sent-22, score-0.464]

9 Describing motion is subtle, because we require a set of categories into which the motion can be classiﬁed; except in the case of speciﬁc activities, there is no known natural set of categories. [sent-23, score-0.684]

10 In our opinion, it is difﬁcult to establish a canonical set of human motion categories, and more practical to produce a system that allows easy revision of the categories (section 2). [sent-25, score-0.408]

11 We use 3 core components; annotation, tracking, and motion synthesis. [sent-27, score-0.342]

12 Initially, a user labels a collection of 3D motion capture frames with annotations (section 2). [sent-28, score-0.938]

13 Given a new video sequence to annotate, we use a kinematic tracker to obtain 2D tracks of each ﬁgure in sequence (section 3). [sent-29, score-0.429]

14 user annotations 3D motion library 2D tracks tracker video motion synthesis Figure 1: Our annotation system consists of 3 main components; annotation, tracking, and motion synthesis (the shaded nodes). [sent-30, score-2.238]

15 A user initially labels a collection of 3D motion capture frames with annotations. [sent-31, score-0.592]

16 Given a new video sequence to annotate, we use a kinematic tracker to obtain 2D tracks of each ﬁgure in sequence. [sent-32, score-0.379]

17 We then synthesize 3D motion sequences which look like the 2D tracks by lifting tracks to 3D and matching them to our annotated motion capture library. [sent-33, score-1.104]

18 We accept the annotations associated with the synthesized 3D motion sequence as annotations for the underlying video sequence. [sent-34, score-1.185]

19 We then synthesize 3D motion sequences which look like the 2D tracks by lifting tracks to 3D and matching them to our annotated motion capture library (section 4). [sent-35, score-1.104]

20 We ﬁnally smooth the annotations associated with the synthesized 3D motion sequence (section 5), accepting them as annotations for the underlying video sequence. [sent-36, score-1.185]

21 Obtaining Annotated Data We have annotated a body of motion data with an annotation system, described in detail in [3]; we repeat some information here for the convenience of the reader. [sent-38, score-0.951]

22 There is no reason to believe that a canonical annotation vocabulary is available for everyday motion, meaning that the system of annotation should be ﬂexible. [sent-39, score-1.034]

23 Our annotation system attaches a bit string to each frame of motion. [sent-42, score-0.638]

24 Each bit in the string represents annotation with a particular element of the vocabulary, meaning that elements of the vocabulary can be composed arbitrarily. [sent-43, score-0.563]

25 Actual annotation is simpliﬁed by using an approach where the user bootstraps a classiﬁer. [sent-44, score-0.462]

26 The user annotates a series of example frames by hand by selecting a sequence from the motion collection; a classiﬁer is then learned from these examples, and the user reviews the resulting annotations. [sent-46, score-0.636]

27 If they are not acceptable, the user revises the annotations at will, and then re-learns a classiﬁer. [sent-47, score-0.382]

28 The classiﬁer itself uses a radial basis function kernel, and uses the joint positions for one second of motion centered at the frame being classiﬁed as a feature vector. [sent-49, score-0.473]

29 Since the motion is sampled in time, each joint has a discrete 3D trajectory in space for the second of motion centered at the frame. [sent-50, score-0.684]

30 Our reference collection consists of a total of 7 minutes of motion capture data. [sent-53, score-0.412]

31 The vocabulary that we chose to annotate this database consisted of: run, walk, wave, jump, turn left, turn right, catch, reach, carry, backwards, crouch, stand, and pick up. [sent-54, score-0.212]

32 Some of these annotations co-occur: turn left while walking, or catch while jumping and running. [sent-55, score-0.571]

33 Our approach admits any combination of annotations, though some combinations may not be used in practice: for example, we can’t conceive of a motion that should be annotated with both stand and run. [sent-56, score-0.531]

34 We have veriﬁed that a consistent set of annotations to describe a motion set can be picked by asking people outside our research group to annotate the same database and comparing annotation results. [sent-59, score-1.215]

35 The model is built by applying detuned body segment detectors to some or all frames in a sequence. [sent-65, score-0.259]

36 These detectors respond to roughly parallel contrast energies at a set of ﬁxed scales (one for the torso and one for other segments). [sent-66, score-0.201]

37 For the frames that are used to build the model, we cluster together segments that are sufﬁciently close in appearance — as encoded by a patch of pixels within the segment — and appear in multiple frames without violating upper bounds on velocity. [sent-68, score-0.454]

38 The torso is used as a root, because our torso detector is quite reliable. [sent-71, score-0.402]

39 One then looks for segments that lie close to the torso in multiple frames to form arm and leg segments. [sent-72, score-0.379]

40 Detecting the learned appearance model in the sequence of frames is straightforward [8]. [sent-75, score-0.252]

41 We assume that camera motion can be recovered from a video sequence and so we need only to recover the pose of the root of the body model — in our case, the torso — with respect to the camera. [sent-81, score-0.893]

42 We represent the 2D key points with respect to a 2D torso coordinate frame. [sent-84, score-0.233]

43 We analogously convert the motion capture data to 3D key points represented with respect to the 3D torso coordinate frame. [sent-85, score-0.645]

44 This means that the scaling of the body can be folded in with the camera scale, and the overall scale is be estimated using corresponding limb lengths in lateral views (which can be identiﬁed because they maximize the limb lengths). [sent-87, score-0.286]

45 Our motion capture database is too large for us to use every frame in the matching process. [sent-89, score-0.577]

46 Furthermore, many motion fragments are similar — there is an awful lot of running — so we vector quantize the 11,000 frames down to k = 300 frames by clustering with k-means and retaining only the cluster medoids. [sent-90, score-0.676]

47 Our distance metric is a weighted sum of differences between 3D key point positions, velocities, and accelerations ([2] found this metric sufﬁcient to ensure smooth motion synthesis). [sent-91, score-0.374]

48 The motion capture data are M t1 M1 T1 T t m Variables (a) M1 M m 1 T 1 Directed model (b) M2 M3 M1 M2 M3 T1 T2 T3 T1 T2 T3 1 Undirected model (c) Factorial HMM (d) Triangulated FHMM (e) Figure 2: In (a), the variables under discussion in camera inference. [sent-92, score-0.501]

49 M is a representation of ﬁgure in 3D with respect to its root coordinate frame, m is the partially observed vector of 2D key points, t is the known camera position and T is the position of the root of the 3D ﬁgure. [sent-93, score-0.225]

50 In (b) a camera model for frame i where 2D keypoints are dependent on the camera position, 3D ﬁgure conﬁguration, and the root of the 3D ﬁgure. [sent-94, score-0.402]

51 In practice, we do not need to model the translations for the 3D root (which is the torso); our tracker reports the (x, y) image position for the torso, and we simply accept these reports. [sent-103, score-0.229]

52 This means that T reduces to a single scalar representing the orientation of the torso along the ground plane. [sent-104, score-0.244]

53 The relative out of image plane movement of the torso (in the z direction) can be recovered from the ﬁnal inferred M and T values by integration — one sums the out of plane velocities of the rotated motion capture frames. [sent-105, score-0.718]

54 This means that ψviewi (Mi , Ti ) is the mean squared error between the visible 2D key points mi and the corresponding 3D keypoints Mi rendered at orientation Ti . [sent-110, score-0.261]

55 To incorporate higher-order dynamic information such as velocities and accelerations, we add keypoints from the two preceding and two following frames when computing the mean squared error. [sent-112, score-0.214]

56 We quantize the torso orientation Ti into a total of c = 20 values. [sent-113, score-0.244]

57 This means that the potential ψviewi (Mi , Ti ) is represented by a c × k table (recall that k is the total number of motion capture medoids used, section 4). [sent-114, score-0.412]

58 We must also deﬁne a potential linking body conﬁgurations in time, representing the continuity cost of placing one motion after another. [sent-115, score-0.429]

59 This is a k × k table, and we set the (i, j)’th entry of this table to be the distance between the j’th medoid and the frame following the i’th medoid, using the metric used for vector quantizing the motion capture dataset (section 4). [sent-117, score-0.598]

60 We show smoothed annotation results for a sequence of jumping jacks (sometimes known as star jumps) from two such annotation systems. [sent-122, score-0.993]

61 In the bottom, we show signals representing annotation bits over time. [sent-125, score-0.467]

62 The automatic annotation consists of a total of 16 bits; present, front faceing, plus the 13 bits from the annotation vocabulary of Sec. [sent-127, score-1.03]

63 In ﬁrst dotted line, corresponding to the image above it, the manual annotator asserts the ﬁgure is present, frontally faceing, and about to reach the extended stance. [sent-129, score-0.219]

64 The annotations for both systems are reasonable given there are no corresponding categories available (this is like describing a movement that is totally unfamiliar). [sent-131, score-0.383]

65 On the left, we freely allow ’null’ annotations (where no annotation bit is set). [sent-132, score-0.823]

66 On the right, we discourage ’null’ annotations as described in Sec. [sent-133, score-0.346]

67 For example, we expect the torso angular velocity of a turning motion frame to be different from a walking forward frame. [sent-137, score-0.817]

68 π) and the actual torso angular velocity of the medoid Mi . [sent-141, score-0.329]

69 By modeling camera dependencies, we are able to ﬁx incorrect torso orientations present in the left system (i. [sent-157, score-0.348]

70 , the ﬁrst image frame and the automatic left faceing and right faceing annotation bits). [sent-159, score-0.813]

71 Although the smoothing system correctly annotates the last image frame with backward, the occluded arm incorrectly triggers a wave, by the mechanism described in Sec. [sent-161, score-0.261]

72 However, this is simplicity at the cost of wasting an important constraint — the camera does not ﬂip around the body from frame to frame. [sent-164, score-0.307]

73 In particular, in a lateral view of a ﬁgure in the stance phase of walking it is very difﬁcult to tell which way the actor is facing without reference to other frames — where it may not be ambiguous. [sent-166, score-0.341]

74 The simplest method for reporting annotations is to produce an annoˆ ˆ tation that is some function of {Mi }. [sent-170, score-0.346]

75 We could now report either the annotation of the medoid, the annotation that appears most frequently in the cluster, the annotation of the cluster element that matches the image best, or the frequency of annotations across the cluster. [sent-173, score-1.709]

76 The fourth alternative produces results that may be useful for some kinds of decisionmaking, but are very difﬁcult to interpret directly — each frame generates a posterior probability over the annotation vocabulary — and we do not discuss it further here. [sent-174, score-0.643]

77 Each of the ﬁrst three tends to produce choppy annotation streams (ﬁgure 4, center). [sent-175, score-0.426]

78 The dashed vertical lines indicate annotations corresponding to the frames shown. [sent-178, score-0.49]

79 The automatic annotations are largely accurate: the ﬁgures are correctly identiﬁed, and the direction in which the ﬁgures are facing are largely correct. [sent-179, score-0.479]

80 fairly rough approximation of a smoothness constraint (because some frames in one cluster might link well to some frames in another and badly to others in that same cluster). [sent-183, score-0.334]

81 Smoothing Annotations: Recall that we have 13 terms in our annotation vocabulary, each of which can be on or off for any given frame. [sent-185, score-0.426]

82 Clearly, we cannot smooth annotation bits directly, because we might very likely create bit strings that never occur. [sent-187, score-0.518]

83 Instead, we regard each observed annotation string as a codeword. [sent-188, score-0.426]

84 Note that this model is fully observed in the 11,000 frames of the motion database; we know the true code word for each motion frame and the cluster to which the frame belongs. [sent-192, score-1.136]

85 We now apply this model to the MAP estimate of {Mi }, inferring a sequence of annotation codewords (which we can later expand back into annotation bit vectors). [sent-194, score-0.981]

86 3, which shows annotation results for a 91 frame jumping jack (or star jump) sequence. [sent-203, score-0.618]

87 The top 4 lower case annotations are hand-labeled over the entire 91 frame sequence. [sent-204, score-0.477]

88 Generally, automatic annotation is successful: the ﬁgure is detected correctly, oriented correctly (this is recovered from the torso orientation estimates Ti ), and the description of the ﬁgure’s activities is largely correct. [sent-205, score-0.762]

89 4 compares three versions of our system on a 288 frame sequence of a ﬁgure walking back and forth. [sent-207, score-0.281]

90 Comparing the center annotations with those on the right (smoothed with our HMM) shows that annotation smoothing makes it possible to remove spurious jump, reach, and stand labels — the label dynamics are wrong. [sent-209, score-0.927]

91 We show smoothed annotations for three ﬁgures from one sequence passing a ball back and forth in Fig. [sent-210, score-0.426]

92 Each actor is correctly detected, and the system produces largely correct descriptions of the actor’s orientation and actions. [sent-212, score-0.226]

93 Quite often, the walk annotation will ﬁre as the ﬁgure slows down to turn from face right to face left or vice versa. [sent-214, score-0.539]

94 When the ﬁgures use their arms to catch or throw, we see increased activity for the similar annotations of catch, wave, and reach. [sent-215, score-0.51]

95 When a novel motion is encountered, we want the system to either respond by (1) recognizing it cannot annotate this sequence, or (2) annotate it with the best match possible. [sent-216, score-0.506]

96 We can implement (2) by adjusting the parameters for our smoothing HMM so that the ’null’ codeword (all annotation bits being off) is unlikely. [sent-217, score-0.5]

97 3, system (1) responds to a jumping jack sequence (star jump, in some circles) with a combination of walking and jumping while waveing. [sent-219, score-0.272]

98 In system (2), we see an additional standing annotation for when the ﬁgure is near the closed stance. [sent-220, score-0.518]

99 Recognition of human body motion using phase space constraints. [sent-256, score-0.465]

100 Bayesian estimation of 3D human motion from an image sequence. [sent-290, score-0.417]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('annotation', 0.426), ('annotations', 0.346), ('motion', 0.342), ('torso', 0.201), ('lface', 0.166), ('rface', 0.166), ('catch', 0.164), ('bkwd', 0.152), ('mi', 0.145), ('frames', 0.144), ('tracker', 0.138), ('frame', 0.131), ('crouch', 0.124), ('jump', 0.119), ('walk', 0.113), ('lturn', 0.11), ('rturn', 0.11), ('ti', 0.106), ('wave', 0.105), ('annotated', 0.096), ('stand', 0.093), ('camera', 0.089), ('body', 0.087), ('vocabulary', 0.086), ('tracks', 0.086), ('actor', 0.083), ('faceing', 0.083), ('gure', 0.078), ('video', 0.072), ('walking', 0.07), ('capture', 0.07), ('annotate', 0.067), ('everyday', 0.066), ('reach', 0.064), ('guration', 0.064), ('jumping', 0.061), ('pick', 0.059), ('appearance', 0.058), ('carry', 0.058), ('fface', 0.055), ('medoid', 0.055), ('viewi', 0.055), ('run', 0.055), ('limb', 0.055), ('root', 0.052), ('bit', 0.051), ('automatic', 0.051), ('sequence', 0.05), ('synthesize', 0.048), ('ramanan', 0.048), ('manual', 0.047), ('con', 0.046), ('cluster', 0.046), ('gures', 0.044), ('stance', 0.044), ('throw', 0.044), ('orientation', 0.043), ('null', 0.041), ('largely', 0.041), ('bits', 0.041), ('actors', 0.041), ('annotator', 0.041), ('fhmm', 0.041), ('keypoints', 0.041), ('gurations', 0.04), ('synthesis', 0.039), ('image', 0.039), ('velocity', 0.038), ('movement', 0.037), ('tracking', 0.037), ('user', 0.036), ('quantized', 0.036), ('human', 0.036), ('angular', 0.035), ('matching', 0.034), ('people', 0.034), ('segments', 0.034), ('kinematic', 0.033), ('siggraph', 0.033), ('unfamiliar', 0.033), ('smoothing', 0.033), ('key', 0.032), ('closed', 0.031), ('standing', 0.031), ('dependency', 0.03), ('smoothed', 0.03), ('system', 0.03), ('spurious', 0.029), ('synthesized', 0.029), ('velocities', 0.029), ('descriptions', 0.029), ('factorial', 0.029), ('present', 0.028), ('segment', 0.028), ('links', 0.028), ('annotates', 0.028), ('arikan', 0.028), ('codewords', 0.028), ('cviu', 0.028), ('frontally', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 37 nips-2003-Automatic Annotation of Everyday Movements

Author: Deva Ramanan, David A. Forsyth

2 0.27598774 12 nips-2003-A Model for Learning the Semantics of Pictures

Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon

Abstract: We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model signiﬁcantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval. 1

3 0.19168481 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

Author: Leonid Sigal, Michael Isard, Benjamin H. Sigelman, Michael J. Black

Abstract: The detection and pose estimation of people in images and video is made challenging by the variability of human appearance, the complexity of natural scenes, and the high dimensionality of articulated body models. To cope with these problems we represent the 3D human body as a graphical model in which the relationships between the body parts are represented by conditional probability distributions. We formulate the pose estimation problem as one of probabilistic inference over a graphical model where the random variables correspond to the individual limb parameters (position and orientation). Because the limbs are described by 6-dimensional vectors encoding pose in 3-space, discretization is impractical and the random variables in our model must be continuousvalued. To approximate belief propagation in such a graph we exploit a recently introduced generalization of the particle ﬁlter. This framework facilitates the automatic initialization of the body-model from low level cues and is robust to occlusion of body parts and scene clutter. 1

4 0.17593934 7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd

Author: Scott A. Beardsley, Lucia M. Vaina

Abstract: Psychophysical studies suggest the existence of specialized detectors for component motion patterns (radial, circular, and spiral), that are consistent with the visual motion properties of cells in the dorsal medial superior temporal area (MSTd) of non-human primates. Here we use a biologically constrained model of visual motion processing in MSTd, in conjunction with psychophysical performance on two motion pattern tasks, to elucidate the computational mechanisms associated with the processing of widefield motion patterns encountered during self-motion. In both tasks discrimination thresholds varied significantly with the type of motion pattern presented, suggesting perceptual correlates to the preferred motion bias reported in MSTd. Through the model we demonstrate that while independently responding motion pattern units are capable of encoding information relevant to the visual motion tasks, equivalent psychophysical performance can only be achieved using interconnected neural populations that systematically inhibit non-responsive units. These results suggest the cyclic trends in psychophysical performance may be mediated, in part, by recurrent connections within motion pattern responsive areas whose structure is a function of the similarity in preferred motion patterns and receptive field locations between units. 1 In trod u ction A major challenge in computational neuroscience is to elucidate the architecture of the cortical circuits for sensory processing and their effective role in mediating behavior. In the visual motion system, biologically constrained models are playing an increasingly important role in this endeavor by providing an explanatory substrate linking perceptual performance and the visual properties of single cells. Single cell studies indicate the presence of complex interconnected structures in middle temporal and primary visual cortex whose most basic horizontal connections can impart considerable computational power to the underlying neural population [1, 2]. Combined psychophysical and computational studies support these findings Figure 1: a) Schematic of the graded motion pattern (GMP) task. Discrimination pairs of stimuli were created by perturbing the flow angle (φ) of each 'test' motion (with average dot speed, vav), by ±φp in the stimulus space spanned by radial and circular motions. b) Schematic of the shifted center-of-motion (COM) task. Discrimination pairs of stimuli were created by shifting the COM of the ‘test’ motion to the left and right of a central fixation point. For each motion pattern the COM was shifted within the illusory inner aperture and was never explicitly visible. and suggest that recurrent connections may play a significant role in encoding the visual motion properties associated with various psychophysical tasks [3, 4]. Using this methodology our goal is to elucidate the computational mechanisms associated with the processing of wide-field motion patterns encountered during self-motion. In the human visual motion system, psychophysical studies suggest the existence of specialized detectors for the motion pattern components (i.e., radial, circular and spiral motions) associated with self-motion [5, 6]. Neurophysiological studies reporting neurons sensitive to motion patterns in the dorsal medial superior temporal area (MSTd) support the existence of such mechanisms [7-10], and in conjunction with psychophysical studies suggest a strong link between the patterns of neural activity and motion-based perceptual performance [11, 12]. Through the combination of human psychophysical performance and biologically constrained modeling we investigate the computational role of simple recurrent connections within a population of MSTd-like units. Based on the known visual motion properties within MSTd we ask what neural structures are computationally sufficient to encode psychophysical performance on a series of motion pattern tasks. 2 M o t i o n pa t t e r n d i sc r i m i n a t i o n Using motion pattern stimuli consistent with previous studies [5, 6], we have developed a set of novel psychophysical tasks designed to facilitate a more direct comparison between human perceptual performance and the visual motion properties of cells in MSTd that have been found to underlie the discrimination of motion patterns [11, 12]. The psychophysical tasks, referred to as the graded motion pattern (GMP) and shifted center-of-motion (COM) tasks, are outlined in Fig. 1. Using a temporal two-alternative-forced-choice task we measured discrimination thresholds to global changes in the patterns of complex motion (GMP task), [13], and shifts in the center-of-motion (COM task). Stimuli were presented with central fixation using a constant stimulus paradigm and consisted of dynamic random dot displays presented in a 24o annular region (central 4o removed). In each task, the stimulus duration was randomly perturbed across presentations (440±40 msec) to control for timing-based cues, and dots moved coherently through a radial speed Figure 2: a) GMP thresholds across 8 'test' motions at two mean dot speeds for two observers. Performance varied continuously with thresholds for radial motions (φ=0, 180o) significantly lower than those for circular motions (φ=90,270o), (p<0.001; t(37)=3.39). b) COM thresholds at three mean dot speeds for two observers. As with the GMP task, performance varied continuously with thresholds for radial motions significantly lower than those for circular motions, (p<0.001; t(37)=4.47). gradient in directions consistent with the global motion pattern presented. Discrimination thresholds were obtained across eight ‘test’ motions corresponding to expansion, contraction, CW and CCW rotation, and the four intermediate spiral motions. To minimize adaptation to specific motion patterns, opposing motions (e.g., expansion/ contraction) were interleaved across paired presentations. 2.1 Results Discrimination thresholds are reported here from a subset of the observer population consisting of three experienced psychophysical observers, one of which was naïve to the purpose of the psychophysical tasks. For each condition, performance is reported as the mean and standard error averaged across 8-12 thresholds. Across observers and dot speeds GMP thresholds followed a distinct trend in the stimulus space [13], with radial motions (expansion/contraction) significantly lower than circular motions (CW/CCW rotation), (p<0.001; t(37)=3.39), (Fig. 2a). While thresholds for the intermediate spiral motions were not significantly different from circular motions (p=0.223, t(60)=0.74), the trends across 'test' motions were well fit within the stimulus space (SB: r>0.82, SC: r>0.77) by sinusoids whose period and phase were 196 ± 10o and -72 ± 20o respectively (Fig. 1a). When the radial speed gradient was removed by randomizing the spatial distribution of dot speeds, threshold performance increased significantly across observers (p<0.05; t(17)=1.91), particularly for circular motions (p<0.005; t(25)=3.31), (data not shown). Such performance suggests a perceptual contribution associated with the presence of the speed gradient and is particularly interesting given the fact that the speed gradient did not contribute computationally relevant information to the task. However, the speed gradient did convey information regarding the integrative structure of the global motion field and as such suggests a preference of the underlying motion mechanisms for spatially structured speed information. Similar trends in performance were observed in the COM task across observers and dot speeds. Discrimination thresholds varied continuously as a function of the 'test' motion with thresholds for radial motions significantly lower than those for circular motions, (p<0.001; t(37)=4.47) and could be well fit by a sinusoidal trend line (e.g. SB at 3 deg/s: r>0.91, period = 178 ± 10 o and phase = -70 ± 25o), (Fig. 2b). 2.2 A local or global task? The consistency of the cyclic threshold profile in stimuli that restricted the temporal integration of individual dot motions [13], and simultaneously contained all directions of motion, generally argues against a primary role for local motion mechanisms in the psychophysical tasks. While the psychophysical literature has reported a wide variety of “local” motion direction anisotropies whose properties are reminiscent of the results observed here, e.g. [14], all would predict equivalent thresholds for radial and circular motions for a set of uniformly distributed and/or spatially restricted motion direction mechanisms. Together with the computational impact of the speed gradient and psychophysical studies supporting the existence of wide-field motion pattern mechanisms [5, 6], these results suggest that the threshold differences across the GMP and COM tasks may be associated with variations in the computational properties across a series of specialized motion pattern mechanisms. 3 A computational model The similarities between the motion pattern stimuli used to quantify human perception and the visual motion properties of cells in MSTd suggests that MSTd may play a computational role in the psychophysical tasks. To examine this hypothesis, we constructed a population of MSTd-like units whose visual motion properties were consistent with the reported neurophysiology (see [13] for details). Across the population, the distribution of receptive field centers was uniform across polar angle and followed a gamma distribution Γ(5,6) across eccenticity [7]. For each unit, visual motion responses followed a gaussian tuning profile as a function of the stimulus flow angle G( φ), (σi=60±30o; [10]), and the distance of the stimulus COM from the unit’s receptive field center Gsat(xi, yi, σs=19o), Eq. 1, such that its preferred motion response was position invariant to small shifts in the COM [10] and degraded continuously for large shifts [9]. Within the model, simulations were categorized according to the distribution of preferred motions represented across the population (one reported in MSTd and a uniform control). The first distribution simulated an expansion bias in which the density of preferred motions decreased symmetrically from expansions to contraction [10]. The second distribution simulated a uniform preference for all motions and was used as a control to quantify the effects of an expansion bias on psychophysical performance. Throughout the paper we refer to simulations containing these distributions as ‘Expansion-biased’ and ‘Uniform’ respectively. 3.1 Extracting perceptual estimates from the neural code For each stimulus presentation, the ith unit’s response was calculated as the average firing rate, Ri, from the product of its motion pattern and spatial tuning profiles, ( ) Ri = Rmax G min[φ − φi ] ,σ ti G sati (x− xi , y − y i ,σ s ) + P (λ = 12 ) (1) where Rmax is the maximum preferred stimulus response (spikes/s), min[ ] refers to the minimum angular distance between the stimulus flow angle φ and the unit’s preferred motion φi, Gsat is the unit’s spatial tuning profile saturated within the central 5±3o, σti and σs are the standard deviations of the unit’s motion pattern and Figure 3: Model vs. psychophysical performance for independently responding units. Model thresholds are reported as the average (±1 S.E.) across five simulated populations. a) GMP thresholds were highest for contracting motions and lowest for expanding motions across all Expansion-biased populations. b) Comparable trends in performance were observed for COM thresholds. Comparison with the Uniform control simulations in both tasks (2000 units shown here) indicates that thresholds closely followed the distribution of preferred motions simulated within the model. spatial tuning profiles respectively, (xi,yi) is the spatial location of the unit’s receptive field center, (x,y) is the spatial location of the stimulus COM, and P(λ=12) is the background activity simulated as an uncorrelated Poisson process. The psychophysical tasks were simulated using a modified center-of-gravity ^ approach to decode estimates of the stimulus properties, i.e. flow angle (φ ) and ˆ ˆ COM location in the visual field (x, y ) , from the neural population   ∑ xi Ri ∑ y i Ri v  , i , ∑ φ i Ri  ∑ Ri i   i i   i (xˆ, yˆ , φˆ) =  i∑ R  (2) v where φi is the unit vector in the stimulus space (Fig. 1a) corresponding to the unit’s preferred motion. For each set of paired stimuli, psychophysical judgments were made by comparing the estimated stimulus properties according to the discrimination criteria, specified in the psychophysical tasks. As with the psychophysical experiments, discrimination thresholds were computed using a leastsquares fit to percent correct performance across constant stimulus levels. 3.2 Simulation 1: Independent neural responses In the first series of simulations, GMP and COM thresholds were quantified across three populations (500, 1000, and 2000 units) of independently responding units for each simulated distribution (Expansion-biased and Uniform). Across simulations, both the range in thresholds and their trends across ‘test’ motions were compared with human psychophysical performance to quantify the effects of population size and an expansion biased preferred motion distribution on model performance. Over the psychophysical range of interest (φp ± 7o), GMP thresholds for contracting motions were at chance across all Expansion-biased populations, (Fig. 3a). While thresholds for expanding motions were generally consistent with those for human observers, those for circular motions remained significantly higher for all but the largest populations. Similar trends in performance were observed for the COM task, (Fig. 3b). Here the range of COM thresholds was well matched with human performance for simulations containing 1000 units, however, the trends across motion patterns remained inconsistent even for the largest populations. Figure 4: Proposed recurrent connection profile between motion pattern units. a) Across the motion pattern space connection strength followed an inverse gaussian profile such that the ith unit (with preferred motion φi) systematically inhibited units with anti-preferred motions centered at 180+φi. b) Across the visual field connection strength followed a difference-of-gaussians profile as a function of the relative distance between receptive field centers such that spatially local units are mutually excitatory (σRe=10o) and more distant units were mutually inhibitory (σRi=80o). For simulations containing a uniform distribution of preferred motions, the threshold range was consistent with human performance on both tasks, however, the trend across motion patterns was generally flat. What variability did occur was due primarily to the discrete sampling of preferred motions across the population. Comparison of the discrimination thresholds for the Expansion-biased and Uniform populations indicates that the trend across thresholds was closely matched to the underlying distributions of preferred motions. This result in due in part to the nearequal weighting of independently responding units and can be explained to a first approximation by the proportional increase in the signal-to-noise ratio across the population as a function of the density of units responsive to a given 'test' motion. 3.3 Simulation 2: An interconnected neural structure In a second series of simulations, we examined the computational effect of adding recurrent connections between units. If the distribution of preferred motions in MSTd is in fact biased towards expansions, as the neurophysiology suggests, it seems unlikely that independent estimates of the visual motion information would be sufficient to yield the threshold profiles observed in the psychophysical tasks. We hypothesize that a simple fixed architecture of excitatory and/or inhibitory connections is sufficient to account for the cyclic trends in discrimination thresholds. Specifically, we propose that a recurrent connection profile whose strength varies as a function of (a) the similarity between preferred motion patterns and (b) the distance between receptive field centers, is computationally sufficient to recover the trends in GMP/COM performance (Fig. 4), wij = S R e − ( xi − x j )2 + ( yi − y j )2 2 2σ R e − SR e 2 − −(min[ φi − φ j ])2 ( xi − x j )2 + ( yi − y j )2 2 2 σ Ri − Sφ e 2σ I2 (3) Figure 5: Model vs. psychophysical performance for populations containing recurrent connections (σI=80o). As the number of units increased for Expansionbiased populations, discrimination thresholds decreased to psychophysical levels and the sinusoidal trend in thresholds emerged for both the (a) GMP and (b) COM tasks. Sinusoidal trends were established for as few as 1000 units and were well fit (r>0.9) by sinusoids whose periods and phases were (193.8 ± 11.7o, -70.0 ± 22.6o) and (168.2 ± 13.7o, -118.8 ± 31.8o) for the GMP and COM tasks respectively. where wij is the strength of the recurrent connection between ith and jth units, (xi,yi) and (xj,yj) denote the spatial locations of their receptive field centers, σRe (=10o) and σRi (=80o) together define the spatial extent of a difference-of-gaussians interaction between receptive field centers, and SR and Sφ scale the connection strength. To examine the effects of the spread of motion pattern-specific inhibition and connection strength in the model, σI, Sφ, and SR were considered free parameters. Within the parameter space used to define recurrent connections (i.e., σI, Sφ and SR), Monte Carlo simulations of Expansion-biased model performance (1000 units) yielded regions of high correlation on both tasks (with respect to the psychophysical thresholds, r>0.7) that were consistent across independently simulated populations. Typically these regions were well defined over a broad range such that there was significant overlap between tasks (e.g., for the GMP task (SR=0.03), σI=[45,120o], Sφ=[0.03,0.3] and for the COM task (σI=80o), Sφ = [0.03,0.08], SR = [0.005, 0.04]). Fig. 5 shows averaged threshold performance for simulations of interconnected units drawn from the highly correlated regions of the (σI, Sφ, SR) parameter space. For populations not explicitly examined in the Monte Carlo simulations connection strengths (Sφ, SR) were scaled inversely with population size to maintain an equivalent level of recurrent activity. With the incorporation of recurrent connections, the sinusoidal trend in GMP and COM thresholds emerged for Expansion-biased populations as the number of units increased. In both tasks the cyclic threshold profiles were established for 1000 units and were well fit (r>0.9) by sinusoids whose periods and phases were consistent with human performance. Unlike the Expansion-biased populations, Uniform populations were not significantly affected by the presence of recurrent connections (Fig. 5). Both the range in thresholds and the flat trend across motion patterns were well matched to those in Section 3.2. Together these results suggest that the sinusoidal trends in GMP and COM performance may be mediated by the combined contribution of the recurrent interconnections and the bias in preferred motions across the population. 4 D i s c u s s i on Using a biologically constrained computational model in conjunction with human psychophysical performance on two motion pattern tasks we have shown that the visual motion information encoded across an interconnected population of cells responsive to motion patterns, such as those in MSTd, is computationally sufficient to extract perceptual estimates consistent with human performance. Specifically, we have shown that the cyclic trend in psychophysical performance observed across tasks, (a) cannot be reproduced using populations of independently responding units and (b) is dependent, in part, on the presence of an expanding motion bias in the distribution of preferred motions across the neural population. The model’s performance suggests the presence of specific recurrent structures within motion pattern responsive areas, such as MSTd, whose strength varies as a function of the similarity between preferred motion patterns and the distance between receptive field centers. While such structures have not been explicitly examined in MSTd and other higher visual motion areas there is anecdotal support for the presence of inhibitory connections [8]. Together, these results suggest that robust processing of the motion patterns associated with self-motion and optic flow may be mediated, in part, by recurrent structures in extrastriate visual motion areas whose distributions of preferred motions are biased strongly in favor of expanding motions. Acknowledgments This work was supported by National Institutes of Health grant EY-2R01-07861-13 to L.M.V. References [1] Malach, R., Schirman, T., Harel, M., Tootell, R., & Malonek, D., (1997), Cerebral Cortex, 7(4): 386-393. [2] Gilbert, C. D., (1992), Neuron, 9: 1-13. [3] Koechlin, E., Anton, J., & Burnod, Y., (1999), Biological Cybernetics, 80: 2544. [4] Stemmler, M., Usher, M., & Niebur, E., (1995), Science, 269: 1877-1880. [5] Burr, D. C., Morrone, M. C., & Vaina, L. M., (1998), Vision Research, 38(12): 1731-1743. [6] Meese, T. S. & Harris, S. J., (2002), Vision Research, 42: 1073-1080. [7] Tanaka, K. & Saito, H. A., (1989), Journal of Neurophysiology, 62(3): 626-641. [8] Duffy, C. J. & Wurtz, R. H., (1991), Journal of Neurophysiology, 65(6): 13461359. [9] Duffy, C. J. & Wurtz, R. H., (1995), Journal of Neuroscience, 15(7): 5192-5208. [10] Graziano, M. S., Anderson, R. A., & Snowden, R., (1994), Journal of Neuroscience, 14(1): 54-67. [11] Celebrini, S. & Newsome, W., (1994), Journal of Neuroscience, 14(7): 41094124. [12] Celebrini, S. & Newsome, W. T., (1995), Journal of Neurophysiology, 73(2): 437-448. [13] Beardsley, S. A. & Vaina, L. M., (2001), Journal of Computational Neuroscience, 10: 255-280. [14] Matthews, N. & Qian, N., (1999), Vision Research, 39: 2205-2211.

5 0.16107982 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

Author: Amit Gruber, Yair Weiss

Abstract: The problem of “Structure From Motion” is a central problem in vision: given the 2D locations of certain points we wish to recover the camera motion and the 3D coordinates of the points. Under simpliﬁed camera models, the problem reduces to factorizing a measurement matrix into the product of two low rank matrices. Each element of the measurement matrix contains the position of a point in a particular image. When all elements are observed, the problem can be solved trivially using SVD, but in any realistic situation many elements of the matrix are missing and the ones that are observed have a diﬀerent directional uncertainty. Under these conditions, most existing factorization algorithms fail while human perception is relatively unchanged. In this paper we use the well known EM algorithm for factor analysis to perform factorization. This allows us to easily handle missing data and measurement uncertainty and more importantly allows us to place a prior on the temporal trajectory of the latent variables (the camera position). We show that incorporating this prior gives a signiﬁcant improvement in performance in challenging image sequences. 1

6 0.12195854 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

7 0.092516348 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

8 0.08326783 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays

9 0.075097866 10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector

10 0.062473331 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

11 0.05948982 196 nips-2003-Wormholes Improve Contrastive Divergence

12 0.059460618 100 nips-2003-Laplace Propagation

13 0.056978583 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

14 0.056472555 112 nips-2003-Learning to Find Pre-Images

15 0.045048553 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

16 0.044976711 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

17 0.043922011 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

18 0.04391522 21 nips-2003-An Autonomous Robotic System for Mapping Abandoned Mines

19 0.04323652 123 nips-2003-Markov Models for Automated ECG Interval Analysis

20 0.042094946 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, -0.023), (2, 0.069), (3, 0.007), (4, -0.196), (5, -0.122), (6, 0.131), (7, 0.059), (8, 0.091), (9, -0.124), (10, -0.066), (11, 0.102), (12, -0.191), (13, 0.165), (14, 0.078), (15, 0.016), (16, 0.054), (17, -0.12), (18, -0.018), (19, -0.31), (20, 0.072), (21, -0.133), (22, -0.036), (23, 0.151), (24, -0.011), (25, -0.116), (26, -0.044), (27, 0.036), (28, 0.145), (29, -0.154), (30, -0.03), (31, 0.156), (32, 0.029), (33, -0.005), (34, 0.004), (35, -0.162), (36, 0.06), (37, 0.039), (38, 0.076), (39, 0.129), (40, 0.007), (41, -0.115), (42, 0.033), (43, -0.075), (44, -0.108), (45, -0.077), (46, -0.125), (47, -0.068), (48, 0.034), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97675884 37 nips-2003-Automatic Annotation of Everyday Movements

Author: Deva Ramanan, David A. Forsyth

2 0.67534697 7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd

Author: Scott A. Beardsley, Lucia M. Vaina

3 0.53334415 12 nips-2003-A Model for Learning the Semantics of Pictures

Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon

4 0.5151825 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

Author: Leonid Sigal, Michael Isard, Benjamin H. Sigelman, Michael J. Black

5 0.51272285 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

Author: Amit Gruber, Yair Weiss

6 0.44695324 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays

7 0.3794964 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

8 0.30427611 10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector

9 0.24227567 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

10 0.22389928 21 nips-2003-An Autonomous Robotic System for Mapping Abandoned Mines

11 0.20689538 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

12 0.20534848 175 nips-2003-Sensory Modality Segregation

13 0.19749954 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

14 0.19702576 196 nips-2003-Wormholes Improve Contrastive Divergence

15 0.19583532 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

16 0.18488079 44 nips-2003-Can We Learn to Beat the Best Stock

17 0.18334147 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

18 0.18105893 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

19 0.17952523 75 nips-2003-From Algorithmic to Subjective Randomness

20 0.17624421 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.038), (11, 0.47), (30, 0.014), (35, 0.036), (53, 0.082), (66, 0.02), (69, 0.023), (71, 0.039), (76, 0.038), (85, 0.053), (91, 0.086)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97633952 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors

Author: Masakazu Yagi, Hideo Yamasaki, Tadashi Shibata

Abstract: A mixed-signal image filtering VLSI has been developed aiming at real-time generation of edge-based image vectors for robust image recognition. A four-stage asynchronous median detection architecture based on analog digital mixed-signal circuits has been introduced to determine the threshold value of edge detection, the key processing parameter in vector generation. As a result, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. A prototype chip was designed in a 0.35-µm double-polysilicon three-metal-layer CMOS technology and the concept was verified by the fabricated chip. The chip generates a 64-dimension feature vector from a 64x64-pixel gray scale image every 80µsec. This is about 104 times faster than the software computation, making a real-time image recognition system feasible. 1 In tro du c ti o n The development of human-like image recognition systems is a key issue in information technology. However, a number of algorithms developed for robust image recognition so far [1]-[3] are mostly implemented as software systems running on general-purpose computers. Since the algorithms are generally complex and include a lot of floating point operations, they are computationally too expensive to build real-time systems. Development of hardware-friendly algorithms and their direct VLSI implementation would be a promising solution for real-time response systems. Being inspired by the biological principle that edge information is firstly detected in the visual cortex, we have developed an edge-based image representation algorithm compatible to hardware processing. In this algorithm, multiple-direction edges extracted from an original gray scale image is utilized to form a feature vector. Since the spatial distribution of principal edges is represented by a vector, it was named Projected Principal-Edge Distribution (PPED) [4],[5], or formerly called Principal Axis Projection (PAP) [6],[7]. (The algorithm is explained later.) Since the PPED vectors very well represent the human perception of similarity among images, robust image recognition systems have been developed using PPED vectors in conjunction with the analog soft pattern classifier [4],[8], the digital VQ (Vector Quantization) processor [9], and support vector machines [10] . The robust nature of PPED representation is demonstrated in Fig. 1, where the system was applied to cephalometric landmark identification (identifying specific anatomical landmarks on medical radiographs) as an example, one of the most important clinical practices of expert dentists in orthodontics [6],[7]. Typical X-ray images to be experienced by apprentice doctors were converted to PPED vectors and utilized as templates for vector matching. The system performance has been proven for 250 head film samples regarding the fundamental 26 landmarks [11]. Important to note is the successful detection of the landmark on the soft tissue boundary (the tip of the lower lip) shown in Fig. 1(c). Landmarks on soft tissues are very difficult to detect as compared to landmarks on hard tissues (solid bones) because only faint images are captured on radiographs. The successful detection is due to the median algorithm that determines the threshold value for edge detection. Sella Nasion Orbitale By our system (a) By expert dentists Landmark on soft tissue (b) (c) Fig. 1: Image recognition using PPED vectors: (a,b) cephalometric landmark identification; (c) successful landmark detection on soft tissue. We have adopted the median value of spatial variance of luminance within the filtering kernel (5x5 pixels), which allows us to extract all essential features in a delicate gray scale image. However, the problem is the high computational cost in determining the median value. It takes about 0.6 sec to generate one PPED vector from a 64x64-pixel image (a standard image size for recognition in our system) on a SUN workstation, making real time processing unrealistic. About 90% of the computation time is for edge detection from an input image, in which most of the time is spent for median detection. Then the purpose of this work is to develop a new architecture median-filter VLSI subsystem for real-time PPED-vector generation. Special attention has been paid to realize a fully seamless pipeline processing from threshold detection to edge feature map generation by employing the four-stage asynchronous median detection architecture. 2 P r o je c t e d P r i n c i pa l E dg e Dis tribution (PPED ) Projected Principal Edge Distribution (PPED) algorithm [5],[6] is briefly explained using Fig. 2(a). A 5x5-pixel block taken from a 64x64-pixel target image is subjected to edge detection filtering in four principal directions, i.e. horizontal, vertical, and ±45-degree directions. In the figure, horizontal edge filtering is shown as an example. (The filtering kernels used for edge detection are given in Fig. 2(b).) In order to determine the threshold value for edge detection, all the absolute-value differences between two neighboring pixels are calculated in both vertical and horizontal directions and the median value is taken as the threshold. By scanning the 5x5-pixel filtering kernels in the target image, four 64x64 edge-flag maps are generated, which are called feature maps. In the horizontal feature map, for example, edge flags in every four rows are accumulated and spatial distribution of edge flags are represented by a histogram having 16 elements. Similar procedures are applied to other three directions to form respective histograms each having 16 elements. Finally, a 64-dimension vector is formed by series-connecting the four histograms in the order of horizontal, +45-degree, vertical, and –45-degree. Edge Detection 64x64 Feature Map (64x64) (Horizontal) 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 -1-1-1-1-1 0 0 0 0 0 (Horizontal) Threshold || Median Scan (16 elements) Edge Filter PPED Vector (Horizontal Section) 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 -1 0 1 0 -1 0 1 0 -1 -1 0 0 -1 0 0 0 Horizontal +45-degree 0 0 0 0 0 Threshold Detection Absolute value difference between neiboring pels. 1 1 1 1 1 0 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 -1 0 0 0 1 0 -1 -1 0 0 1 0 -1 0 0 1 1 0 -1 0 0 0 1 0 Vertical (a) -45-degree (b) Fig. 2: PPED algorithm (a) and filtering kernels for edge detection (b). 3 Sy stem Orga ni za ti o n The system organization of the feature map generation VLSI is illustrated in Fig. 3. The system receives one column of data (8-b x 5 pixels) at each clock and stores the data in the last column of the 5x6 image buffer. The image buffer shifts all the stored data to the right at every clock. Before the edge filtering circuit (EFC) starts detecting four direction edges with respect to the center pixel in the 5x5 block, the threshold value calculated from all the pixel data in the 5x5 block must be ready in time for the processing. In order to keep the coherence of the threshold detection and the edge filtering processing, the two last-in data locating at column 5 and 6 are given to median filter circuit (MFC) in advance via absolute value circuit (AVC). AVC calculates all luminance differences between two neighboring pixels in columns 5 and 6. In this manner, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. The key requirement here is that MFC must determine the median value of the 40 luminance difference data from the 5x5-pixel block fast enough to carry out the seamless pipeline processing. For this purpose, a four-stage asynchronous median detection architecture has been developed which is explained in the following. Edge Filtering Circuit (EFC) 6 5 4 3 2 1 Edge flags H +45 V Image buffer 8-b x 5 pixels (One column) Absolute Value Circuit (AVC) Threshold value Median Filter Circuit (MFC) -45 Feature maps Fig. 3: System organization of feature map generation VLSI. The well-known binary search algorithm was adopted for fast execution of median detection. The median search processing for five 4-b data is illustrated in Fig. 4 for the purpose of explanation. In the beginning, majority voting is carried out for the MSB’s of all data. Namely, the number of 1’s is compared with the number of 0’s and the majority group wins. The majority group flag (“0” in this example) is stored as the MSB of the median value. In addition, the loser group is withdrawn in the following voting by changing all remaining bits to the loser MSB (“1” in this example). By repeating the processing, the median value is finally stored in the median value register. Elapse of time Median Register : 0 1 X X 0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 Majority Flag : 0 0 X X X Majority Voting Circuit (MVC) Fig. 4: Hardware algorithm for median detection by binary search. How the median value is detected from all the 40 8-b data (20 horizontal luminance difference data and 20 vertical luminance difference data) is illustrated in Fig. 5. All the data are stored in the array of median detection units (MDU’s). At each clock, the array receives four vertical luminance difference data and five horizontal luminance difference data calculated from the data in column 5 and 6 in Fig. 3. The entire data are shifted downward at each clock. The median search is carried out for the upper four bits and the lower four bits separately in order to enhance the throughput by pipelining. For this purpose, the chip is equipped with eight majority voting circuits (MVC 0~7). The upper four bits from all the data are processed by MVC 4~7 in a single clock cycle to yield the median value. In the next clock cycle, the loser information is transferred to the lower four bits within each MDU and MVC0~3 carry out the median search for the lower four bits from all the data in the array. Vertical Luminance Difference AVC AVC AVC AVC Horizontal Luminance Difference AVC AVC AVC AVC AVC Shift Shift Median Detection Unit (MDU) x (40 Units) Lower 4bit Upper 4bit MVC0 MVC2 MVC1 MVC3 MVC4 MVC5 MVC6 MVC7 MVCs for upper 4bit MVCs for lower 4bit Fig. 5: Median detection architecture for all 40 luminance difference data. The majority voting circuit (MVC) is shown in Fig. 6. Output connected CMOS inverters are employed as preamplifiers for majority detection which was first proposed in Ref. [12]. In the present implementation, however, two preamps receiving input data and inverted input data are connected to a 2-stage differential amplifier. Although this doubles the area penalty, the instability in the threshold for majority detection due to process and temperature variations has been remarkably improved as compared to the single inverter thresholding in Ref. [12]. The MVC in Fig. 6 has 41 input terminals although 40 bits of data are inputted to the circuit at one time. Bit “0” is always given to the terminal IN40 to yield “0” as the majority when there is a tie in the majority voting. PREAMP IN0 PREAMP 2W/L IN0 2W/L OUT W/L ENBL W/L W/L IN1 IN1 2W/L 2W/L W/L ENBL IN40 W/L W/L IN40 Fig. 6: Majority voting circuit (MVC). The edge filtering circuit (EFC) in Fig. 3 is composed as a four-stage pipeline of regular CMOS digital logic. In the first two stages, four-direction edge gradients are computed, and in the succeeding two stages, the detection of the largest gradient and the thresholding is carried out to generate four edge flags. 4 E x p e r i m e n t a l R es u l t s The feature map generation VLSI was fabricated in a 0.35-µm double-poly three-metal-layer CMOS technology. A photomicrograph of the proof-of-concept chip is shown in Fig. 7. The measured waveforms of the MVC at operating frequencies of 10MHz and 90MHz are demonstrated in Fig. 8. The input condition is in the worst case. Namely, 21 “1” bits and 20 “0” bits were fed to the inputs. The observed computation time is about 12 nsec which is larger than the simulation result of 2.5 nsec. This was caused by the capacitance loading due to the probing of the test circuit. In the real circuit without external probing, we confirmed the average computation time of 4~5 nsec. Edge-detection Filtering Circuit Processing Technology 0.35µm CMOS 2-Poly 3-Metal Median Filter Control Unit Chip Size 4.5mm x 4.5mm MVC Majority Voting Circuit X8 Supply Voltage 3.3 V Operation Frequengy 50MHz Vector Generator Fig. 7: Photomicrograph and specification of the fabricated proof-of-concept chip. 1V/div 5ns/div MVC_Output 1V/div 8ns/div MVC_OUT IN IN 1 Majority Voting operation (a) Majority Voting operation (b) Fig. 8: Measured waveforms of majority voting circuit (MVC) at operation frequencies of 10MHz (a) and 90 MHz (b) for the worst-case input data. The feature maps generated by the chip at the operation frequency of 25 MHz are demonstrated in Fig. 9. The power dissipation was 224 mW. The difference between the flag bits detected by the chip and those obtained by computer simulation are also shown in the figure. The number of error flags was from 80 to 120 out of 16,384 flags, only a 0.6% of the total. The occurrence of such error bits is anticipated since we employed analog circuits for median detection. However, such error does not cause any serious problems in the PPED algorithm as demonstrated in Figs. 10 and 11. The template matching results with the top five PPED vector candidates in Sella identification are demonstrated in Fig. 11, where Manhattan distance was adopted as the dissimilarity measure. The error in the feature map generation processing yields a constant bias to the dissimilarity and does not affect the result of the maximum likelihood search. Generated Feature maps Difference as compared to computer simulation Sella Horizontal Plus 45-degrees Vertical Minus 45-degrees Fig. 9: Feature maps for Sella pattern generated by the chip. Generated PPED vector by the chip Sella Difference as compared to computer simulation Dissimilarity (by Manhattan Distance) Fig. 10: PPED vector for Sella pattern generated by the chip. The difference in the vector components between the PPED vector generated by the chip and that obtained by computer simulation is also shown. 1200 Measured Data 1000 800 Computer Simulation 600 400 200 0 1st (Correct) 2nd 3rd 4th 5th Candidates in Sella recognition Fig. 11: Comparison of template matching results. 5 Conclusion A mixed-signal median filter VLSI circuit for PPED vector generation is presented. A four-stage asynchronous median detection architecture based on analog digital mixed-signal circuits has been introduced. As a result, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. A prototype chip was designed in a 0.35-µm CMOS technology and the fab- ricated chip generates an edge based image vector every 80 µsec, which is about 10 4 times faster than the software computation. Acknowledgments The VLSI chip in this study was fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo with the collaboration by Rohm Corporation and Toppan Printing Corporation. The work is partially supported by the Ministry of Education, Science, Sports, and Culture under Grant-in-Aid for Scientific Research (No. 14205043) and by JST in the program of CREST. References [1] C. Liu and Harry Wechsler, “Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition”, IEEE Transactions on Image Processing, Vol. 11, No.4, Apr. 2002. [2] C. Yen-ting, C. Kuo-sheng, and L. Ja-kuang, “Improving cephalogram analysis through feature subimage extraction”, IEEE Engineering in Medicine and Biology Magazine, Vol. 18, No. 1, 1999, pp. 25-31. [3] H. Potlapalli and R. C. Luo, “Fractal-based classification of natural textures”, IEEE Transactions on Industrial Electronics, Vol. 45, No. 1, Feb. 1998. [4] T. Yamasaki and T. Shibata, “Analog Soft-Pattern-Matching Classifier Using Floating-Gate MOS Technology,” Advances in Neural Information Processing Systems 14, Vol. II, pp. 1131-1138. [5] Masakazu Yagi, Tadashi Shibata, “An Image Representation Algorithm Compatible to Neural-Associative-Processor-Based Hardware Recognition Systems,” IEEE Trans. Neural Networks, Vol. 14, No. 5, pp. 1144-1161, September (2003). [6] M. Yagi, M. Adachi, and T. Shibata,

same-paper 2 0.91915858 37 nips-2003-Automatic Annotation of Everyday Movements

Author: Deva Ramanan, David A. Forsyth

3 0.8849951 88 nips-2003-Image Reconstruction by Linear Programming

Author: Koji Tsuda, Gunnar Rätsch

Abstract: A common way of image denoising is to project a noisy image to the subspace of admissible images made for instance by PCA. However, a major drawback of this method is that all pixels are updated by the projection, even when only a few pixels are corrupted by noise or occlusion. We propose a new method to identify the noisy pixels by 1 -norm penalization and update the identiﬁed pixels only. The identiﬁcation and updating of noisy pixels are formulated as one linear program which can be solved efﬁciently. Especially, one can apply the ν-trick to directly specify the fraction of pixels to be reconstructed. Moreover, we extend the linear program to be able to exploit prior knowledge that occlusions often appear in contiguous blocks (e.g. sunglasses on faces). The basic idea is to penalize boundary points and interior points of the occluded area differently. We are able to show the ν-property also for this extended LP leading a method which is easy to use. Experimental results impressively demonstrate the power of our approach.

4 0.79830784 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie

Abstract: A key issue in supervised protein classiﬁcation is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classiﬁcation performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classiﬁcation performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efﬁciency. 1

5 0.6371206 12 nips-2003-A Model for Learning the Semantics of Pictures

Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon

6 0.60356081 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

7 0.52604491 168 nips-2003-Salient Boundary Detection using Ratio Contour

8 0.45739937 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

9 0.44029659 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

10 0.4362886 119 nips-2003-Local Phase Coherence and the Perception of Blur

11 0.43364671 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression

12 0.43259859 164 nips-2003-Ranking on Data Manifolds

13 0.42990446 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

14 0.42748407 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

15 0.4241614 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models

16 0.41991401 112 nips-2003-Learning to Find Pre-Images

17 0.41587278 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples

18 0.41548997 10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector

19 0.41016012 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

20 0.40581879 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications