nips nips2005 nips2005-110 knowledge-graph by maker-knowledge-mining

110 nips-2005-Learning Depth from Single Monocular Images

Source: pdf

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the task of depth estimation from a single monocular image. [sent-6, score-0.644]

2 We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc. [sent-7, score-0.45]

3 Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. [sent-10, score-0.632]

4 Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. [sent-11, score-1.076]

5 1 Introduction Recovering 3-D depth from images is a basic problem in computer vision, and has important applications in robotics, scene understanding and 3-D reconstruction. [sent-13, score-0.522]

6 Most work on visual 3-D reconstruction has focused on binocular vision (stereopsis) [1] and on other algorithms that require multiple images, such as structure from motion [2] and depth from defocus [3]. [sent-14, score-0.562]

7 Depth estimation from a single monocular image is a difﬁcult task, and requires that we take into account the global structure of the image, as well as use prior knowledge about the scene. [sent-15, score-0.365]

8 In this paper, we apply supervised learning to the problem of estimating depth from single monocular images of unstructured outdoor environments, ones that contain forests, trees, buildings, people, buses, bushes, etc. [sent-16, score-0.814]

9 Shape from shading [7] offers another method for monocular depth reconstruction, but is difﬁcult to apply to scenes that do not have fairly uniform color and texture. [sent-21, score-0.743]

10 In work done independently of ours, Hoiem, Efros and Herbert (personal communication) also considered monocular 3-D reconstruction, but focused on generating 3-D graphical images rather than accurate metric depthmaps. [sent-22, score-0.302]

11 In this paper, we address the task of learning full depthmaps from single images of unconstrained environments. [sent-23, score-0.275]

12 Our approach is based on capturing depths and relationships between depths using an MRF. [sent-28, score-0.976]

13 Using this training set, the MRF is discriminatively trained to predict depth; thus, rather than modeling the joint distribution of image features and depths, we model only the posterior distribution of the depths given the image features. [sent-30, score-0.919]

14 Our basic model uses L2 (Gaussian) terms in the MRF interaction potentials, and captures depths and interactions between depths at multiple spatial scales. [sent-31, score-1.084]

15 Learning in this model is approximate, but exact MAP posterior inference is tractable (similar to Gaussian MRFs) via linear programming, and it gives signiﬁcantly better depthmaps than the simple Gaussian model. [sent-33, score-0.185]

16 2 Monocular Cues Humans appear to be extremely good at judging depth from single monocular images. [sent-34, score-0.644]

17 [12] This is done using monocular cues such as texture variations, texture gradients, occlusion, known object sizes, haze, defocus, etc. [sent-35, score-0.651]

18 [4, 13, 14] For example, many objects’ texture will look different at different distances from the viewer. [sent-36, score-0.209]

19 1 Haze is another depth cue, and is caused by atmospheric light scattering. [sent-38, score-0.432]

20 Most of these monocular cues are “contextual information,” in the sense that they are global properties of an image and cannot be inferred from small image patches. [sent-39, score-0.542]

21 Although local information such as the texture and color of a patch can give some information about its depth, this is usually insufﬁcient to accurately determine its absolute depth. [sent-41, score-0.61]

22 For another example, if we take a patch of a clear blue sky, it is difﬁcult to tell if this patch is inﬁnitely far away (sky), or if it is part of a blue object. [sent-42, score-0.574]

23 3 Feature Vector In our approach, we divide the image into small patches, and estimate a single depth value for each patch. [sent-44, score-0.579]

24 We use two types of features: absolute depth features—used to estimate the absolute depth at a particular patch—and relative features, which we use to estimate relative depths (magnitude of the difference in depth between two patches). [sent-45, score-2.118]

25 We chose features that capture three types of local cues: texture variations, texture gradients, and haze. [sent-46, score-0.528]

26 Texture information is mostly contained within the image intensity channel,2 so we apply Laws’ masks [15, 4] to this channel to compute the texture energy (Fig. [sent-47, score-0.426]

27 The distant patches will have larger variations in the line orientations, and nearby patches will have smaller variations in line orientations. [sent-51, score-0.488]

28 Similarly, a grass ﬁeld when viewed at different distances will have different texture gradient distributions. [sent-52, score-0.245]

29 2 We represent each image in YCbCr color space, where Y is the intensity channel, and Cb and Cr are the color channels. [sent-53, score-0.289]

30 Figure 2: The absolute depth feature vector for a patch, which includes features from its immediate neighbors and its more distant neighbors (at larger scales). [sent-58, score-0.854]

31 The relative depth features for each patch use histograms of the ﬁlter outputs. [sent-59, score-0.862]

32 estimate of texture gradient that is robust to noise, we convolve the intensity channel with six oriented edge ﬁlters (shown in Fig. [sent-60, score-0.255]

33 1 Features for absolute depth Given some patch i in the image I(x, y), we compute summary statistics for it as follows. [sent-63, score-0.944]

34 We use the output of each of the 17 (9 Laws’ masks, 2 color channels and 6 texture gradients) ﬁlters Fn (x, y), n = 1, . [sent-64, score-0.247]

35 To estimate the absolute depth at a patch, local image features centered on the patch are insufﬁcient, and one has to use more global properties of the image. [sent-69, score-1.114]

36 We attempt to capture this information by using image features extracted at multiple scales (image resolutions). [sent-70, score-0.364]

37 ) Objects at different depths exhibit very different behaviors at different resolutions, and using multiscale features allows us to capture these variations [16]. [sent-73, score-0.814]

38 3 In addition to capturing more global information, computing features at multiple spatial scales also help accounts for different relative sizes of objects. [sent-74, score-0.311]

39 occlusion relationships), the features used to predict the depth of a particular patch are computed from that patch as well as the four neighboring patches. [sent-80, score-1.196]

40 at a patch includes features of its immediate neighbors, and its far neighbors (at a larger scale), and its very far neighbors (at the largest scale), as shown in Fig. [sent-82, score-0.549]

41 Thus, we also add to the features of a patch additional summary features of the column it lies in. [sent-85, score-0.608]

42 For each patch, after including features from itself and its 4 neighbors at 3 scales, and summary features for its 4 column patches, our vector of features for estimating depth at a particular patch is 19 ∗ 34 = 646 dimensional. [sent-86, score-1.24]

43 2 Features for relative depth We use a different feature vector to learn the dependencies between two neighboring patches. [sent-88, score-0.607]

44 Speciﬁcally, we compute a histogram (with 10 bins) of each of the 17 ﬁlter outputs |I(x, y) ∗ Fn (x, y)|, giving us a total of 170 features yi for each patch i. [sent-89, score-0.396]

45 These features are used to estimate how the depths at two different locations are related. [sent-90, score-0.654]

46 Hence, we use as our relative depth features the differences between the histograms computed from two neighboring patches y ij = yi − yj . [sent-92, score-0.873]

47 4 The Probabilistic Model The depth of a particular patch depends on the features of the patch, but is also related to the depths of other parts of the image. [sent-93, score-1.316]

48 For example, the depths of two adjacent patches lying in the same building will be highly correlated. [sent-94, score-0.712]

49 We will use an MRF to model the relation between the depth of a patch and the depths of its neighboring patches. [sent-95, score-1.251]

50 In addition to the interactions with the immediately neighboring patches, there are sometimes also strong interactions between the depths of patches which are not immediate neighbors. [sent-96, score-0.854]

51 For example, consider the depths of patches that lie on a large building. [sent-97, score-0.684]

52 All of these patches will be at similar depths, even if there are small discontinuities (such as a window on the wall of a building). [sent-98, score-0.196]

53 However, when viewed at the smallest scale, some adjacent patches are difﬁcult to recognize as parts of the same object. [sent-99, score-0.224]

54 Thus, we will also model interactions between depths at multiple spatial scales. [sent-100, score-0.56]

55 To capture the multiscale depth relations, let us deﬁne di (s) as follows. [sent-102, score-0.656]

56 For each of three scales s = 1, 2, 3, deﬁne di (s + 1) = (1/5) j∈Ns (i)∪{i} dj (s). [sent-103, score-0.265]

57 Here, Ns (i) are the 4 neighbors of patch i at scale s. [sent-104, score-0.355]

58 , the depth at a higher scale is constrained to be the average of the depths at lower scales. [sent-107, score-0.955]

59 In detail, we use different parameters (θr , σ1r , σ2r ) for each row in the image, because the images we consider are taken from a horizontally mounted camera, and thus different rows of the image have different statistical properties. [sent-109, score-0.238]

60 4 For example, given two adjacent patches of a distinctive, unique, color and texture, we may be able to safely conclude that they are part of the same object, and thus that their depths are close, even without more global features. [sent-111, score-0.818]

61 5 For example, a blue patch might represent sky if it is in upper part of image, and might be more likely to be water if in the lower part of the image. [sent-112, score-0.352]

62 The ﬁrst term in the exponent above models depth as a function of multiscale features of a single patch i. [sent-116, score-1.008]

63 The second term in the exponent places a soft “constraint” on the depths to 2 be smooth. [sent-117, score-0.568]

64 If the variance term σ2rs is a ﬁxed constant, the effect of this term is that it tends to smooth depth estimates across nearby patches. [sent-118, score-0.502]

65 However, in practice the dependencies between patches are not the same everywhere, and our expected value for (d i − dj )2 may depend on the features of the local patches. [sent-119, score-0.49]

66 2 Therefore, to improve accuracy we extend the model to capture the “variance” term σ 2rs in the denominator of the second term as a linear function of the patches i and j’s relative 2 depth features yijs (discussed in Section 3. [sent-120, score-0.977]

67 This helps deterrs mine which neighboring patches are likely to have similar depths. [sent-123, score-0.269]

68 , the “smoothing” effect is much stronger if neighboring patches are similar. [sent-126, score-0.269]

69 2 The parameters urs are chosen to ﬁt σ2rs to the expected value of (di (s) − dj (s))2 , with a 2 constraint that urs ≥ 0 (to keep the estimated σ2rs non-negative). [sent-128, score-0.186]

70 This is motivated by the observation that in some cases, depth cannot be reliably estimated from the local features. [sent-132, score-0.432]

71 In this case, one has to rely more on neighboring patches’ depths to infer a patch’s depth (as modeled by the second term in the exponent). [sent-133, score-1.028]

72 After learning the parameters, given a new test-set image we can ﬁnd the MAP estimate of the depths by maximizing Eq. [sent-134, score-0.635]

73 First, a histogram of the relative depths (di − dj ) empirically appears Laplacian, which strongly suggests that it is better modeled as one. [sent-141, score-0.661]

74 Second, the Laplacian distribution has heavier tails, and is therefore more robust to outliers in the image features and error in the trainingset depthmaps (collected with a laser scanner; see Section 5. [sent-142, score-0.507]

75 Third, the Gaussian model was generally unable to give depthmaps with sharp edges; in contrast, Laplacians tend to model sharp transitions/outliers better. [sent-144, score-0.235]

76 Following the Gaussian model, we also learn the Laplacian spread parameters in the denominator in the same way, except that the instead of estimating the expected value of (di − dj )2 , we estimate the expected value of |di − dj |. [sent-151, score-0.282]

77 Even though maximum 6 2 The absolute depth features xir are non-negative; thus, the estimated σ1r is also non-negative. [sent-152, score-0.675]

78 likelihood parameter estimation for θr is intractable in the Laplacian model, given a new test-set image, MAP inference for the depths d is tractable. [sent-153, score-0.488]

79 We can also extend these models to combine Gaussian and Laplacian terms in the exponent, for example by using a L2 norm term for absolute depth, and a L1 norm term for the interaction terms. [sent-156, score-0.211]

80 1 Data collection We used a 3-D laser scanner to collect images and their corresponding depthmaps. [sent-159, score-0.233]

81 We collected a total of 425 image+depthmap pairs, with an image resolution of 1704x2272 and a depthmap resolution of 86x107. [sent-161, score-0.263]

82 Due to noise in the motor system, the depthmaps were not perfectly aligned with the images, and had an alignment error of about 2 depth patches. [sent-163, score-0.617]

83 Also, the depthmaps had a maximum range of 81m (the maximum range of the laser scanner), and had minor additional errors due to reﬂections and missing laser scans. [sent-164, score-0.354]

84 Prior to running our learning algorithms, we transformed all the depths to a log scale so as to emphasize multiplicative rather than additive errors in training. [sent-165, score-0.562]

85 In our earlier experiments (not reported here), learning using linear depth values directly gave poor results. [sent-166, score-0.432]

86 We see that using multiscale and column features signiﬁcantly improves the algorithm’s performance. [sent-172, score-0.282]

87 Empirically, we also observed that the Laplacian model does indeed give depthmaps with signiﬁcantly sharper boundaries (as in our discussion in Section 4. [sent-174, score-0.185]

88 Informally, our algorithm appears to predict the relative depths of objects quite well (i. [sent-183, score-0.586]

89 , their relative distances to the camera), but seems to make more errors in absolute depths. [sent-185, score-0.212]

90 For example, the training set images and depthmaps are slightly misaligned, and therefore the edges in the learned depthmap are not very sharp. [sent-187, score-0.419]

91 Further, the maximum value of the depths in the training set is 81m; therefore, far-away objects are all mapped to the one distance of 81m. [sent-188, score-0.527]

92 Our algorithm appears to incur the largest errors on images which contain very irregular trees, in which most of the 3-D structure in the image is dominated by the shapes of the leaves and branches. [sent-189, score-0.273]

93 6 Conclusions We have presented a discriminatively trained MRF model for depth estimation from single monocular images. [sent-191, score-0.699]

94 Our model uses monocular cues at multiple spatial scales, and also Figure 3: Results for a varied set of environments, showing original image (column 1), ground truth depthmap (column 2), predicted depthmap by Gaussian model (column 3), predicted depthmap by Laplacian model (column 4). [sent-192, score-0.886]

95 (Best viewed in color) Table 1: Effect of multiscale and column features on accuracy. [sent-193, score-0.282]

96 The average absolute errors (RMS errors gave similar results) are on a log scale (base 10). [sent-194, score-0.218]

97 We demonstrated that our algorithm gives good 3-D depth estimation performance on a variety of images. [sent-233, score-0.432]

98 Performance analysis of stereo, vergence, and focus as depth cues for active vision. [sent-249, score-0.49]

99 High speed obstacle avoidance using monocular vision and reinforcement learning. [sent-256, score-0.25]

100 New algorithms from reconstruction of a 3-d depth map from one or more images. [sent-274, score-0.512]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('depths', 0.488), ('depth', 0.432), ('patch', 0.258), ('monocular', 0.212), ('patches', 0.196), ('depthmaps', 0.185), ('texture', 0.175), ('depthmap', 0.144), ('features', 0.138), ('aussian', 0.123), ('image', 0.119), ('dj', 0.114), ('laplacian', 0.112), ('absolute', 0.105), ('multiscale', 0.1), ('mrf', 0.1), ('images', 0.09), ('di', 0.084), ('scanner', 0.078), ('neighboring', 0.073), ('color', 0.072), ('laws', 0.068), ('scales', 0.067), ('sky', 0.065), ('buildings', 0.065), ('laser', 0.065), ('neighbors', 0.062), ('forests', 0.062), ('haze', 0.062), ('saxena', 0.062), ('indoor', 0.061), ('cues', 0.058), ('trees', 0.056), ('masks', 0.053), ('ns', 0.052), ('reconstruction', 0.051), ('variations', 0.048), ('insuf', 0.047), ('vr', 0.045), ('exponent', 0.045), ('column', 0.044), ('environments', 0.042), ('dependencies', 0.042), ('defocus', 0.041), ('gini', 0.041), ('michels', 0.041), ('nagai', 0.041), ('yijs', 0.041), ('unstructured', 0.041), ('capture', 0.04), ('errors', 0.039), ('outdoor', 0.039), ('objects', 0.039), ('gradients', 0.038), ('vision', 0.038), ('spatial', 0.038), ('occlusion', 0.037), ('interaction', 0.036), ('grass', 0.036), ('urs', 0.036), ('scale', 0.035), ('term', 0.035), ('relative', 0.034), ('interactions', 0.034), ('camera', 0.034), ('distances', 0.034), ('global', 0.034), ('forest', 0.033), ('bushes', 0.033), ('kumar', 0.033), ('gaussian', 0.032), ('object', 0.031), ('campus', 0.03), ('mrfs', 0.03), ('laplacians', 0.03), ('discriminatively', 0.03), ('fn', 0.03), ('summary', 0.03), ('immediate', 0.029), ('map', 0.029), ('blue', 0.029), ('mounted', 0.029), ('xr', 0.029), ('lastly', 0.029), ('resolutions', 0.029), ('adjacent', 0.028), ('estimate', 0.028), ('scenes', 0.027), ('energy', 0.027), ('stereo', 0.027), ('ground', 0.027), ('feature', 0.026), ('intensity', 0.026), ('channel', 0.026), ('denominator', 0.026), ('contextual', 0.026), ('collecting', 0.026), ('trained', 0.025), ('appears', 0.025), ('sharp', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

2 0.35666099 23 nips-2005-An Application of Markov Random Fields to Range Sensing

Author: James Diebel, Sebastian Thrun

Abstract: This paper describes a highly successful application of MRFs to the problem of generating high-resolution range images. A new generation of range sensors combines the capture of low-resolution range images with the acquisition of registered high-resolution camera images. The MRF in this paper exploits the fact that discontinuities in range and coloring tend to co-align. This enables it to generate high-resolution, low-noise range images by integrating regular camera images into the range data. We show that by using such an MRF, we can substantially improve over existing range imaging technology. 1

3 0.1746687 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

Author: Tai-sing Lee, Brian R. Potetz

Abstract: This paper explores the statistical relationship between natural images and their underlying range (depth) images. We look at how this relationship changes over scale, and how this information can be used to enhance low resolution range data using a full resolution intensity image. Based on our ﬁndings, we propose an extension to an existing technique known as shape recipes [3], and the success of the two methods are compared using images and laser scans of real scenes. Our extension is shown to provide a two-fold improvement over the current method. Furthermore, we demonstrate that ideal linear shape-from-shading ﬁlters, when learned from natural scenes, may derive even more strength from shadow cues than from the traditional linear-Lambertian shading cues. 1

4 0.10655526 108 nips-2005-Layered Dynamic Textures

Author: Antoni B. Chan, Nuno Vasconcelos

Abstract: A dynamic texture is a video model that treats a video as a sample from a spatio-temporal stochastic process, speciﬁcally a linear dynamical system. One problem associated with the dynamic texture is that it cannot model video where there are multiple regions of distinct motion. In this work, we introduce the layered dynamic texture model, which addresses this problem. We also introduce a variant of the model, and present the EM algorithm for learning each of the models. Finally, we demonstrate the efﬁcacy of the proposed model for the tasks of segmentation and synthesis of video.

5 0.090681918 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun

Abstract: We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw input images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training runs collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two forwardpointing wireless color cameras. A remote computer processes the video and controls the robot via radio. The learning system is a large 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.

6 0.08939448 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

7 0.081822827 169 nips-2005-Saliency Based on Information Maximization

8 0.0797318 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

9 0.077847548 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

10 0.072772875 45 nips-2005-Conditional Visual Tracking in Kernel Space

11 0.071338654 109 nips-2005-Learning Cue-Invariant Visual Responses

12 0.06948676 151 nips-2005-Pattern Recognition from One Example by Chopping

13 0.068843767 127 nips-2005-Mixture Modeling by Affinity Propagation

14 0.068184875 104 nips-2005-Laplacian Score for Feature Selection

15 0.067613445 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

16 0.062281668 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours

17 0.057660375 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

18 0.055108607 98 nips-2005-Infinite latent feature models and the Indian buffet process

19 0.05464866 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

20 0.05403946 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.186), (1, 0.007), (2, -0.011), (3, 0.252), (4, -0.095), (5, 0.074), (6, -0.054), (7, 0.14), (8, 0.174), (9, -0.194), (10, 0.123), (11, 0.181), (12, 0.055), (13, -0.036), (14, 0.131), (15, -0.104), (16, 0.207), (17, 0.04), (18, -0.101), (19, -0.146), (20, 0.027), (21, -0.028), (22, 0.054), (23, -0.012), (24, -0.017), (25, -0.096), (26, -0.021), (27, 0.034), (28, -0.135), (29, 0.162), (30, 0.083), (31, -0.026), (32, 0.013), (33, -0.118), (34, -0.084), (35, -0.039), (36, -0.056), (37, -0.014), (38, 0.036), (39, 0.024), (40, -0.028), (41, 0.037), (42, 0.016), (43, -0.059), (44, 0.007), (45, -0.023), (46, -0.008), (47, 0.048), (48, 0.013), (49, -0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95450729 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

2 0.94016397 23 nips-2005-An Application of Markov Random Fields to Range Sensing

Author: James Diebel, Sebastian Thrun

3 0.82149023 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

Author: Tai-sing Lee, Brian R. Potetz

4 0.57211763 108 nips-2005-Layered Dynamic Textures

Author: Antoni B. Chan, Nuno Vasconcelos

5 0.55080432 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun

6 0.36766243 169 nips-2005-Saliency Based on Information Maximization

7 0.35568303 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours

8 0.31659034 151 nips-2005-Pattern Recognition from One Example by Chopping

9 0.30077139 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

10 0.27702191 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

11 0.27210614 109 nips-2005-Learning Cue-Invariant Visual Responses

12 0.27013844 45 nips-2005-Conditional Visual Tracking in Kernel Space

13 0.25996035 158 nips-2005-Products of ``Edge-perts

14 0.25838274 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

15 0.25726038 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

16 0.25326279 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

17 0.25310865 131 nips-2005-Multiple Instance Boosting for Object Detection

18 0.25203234 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

19 0.24937372 127 nips-2005-Mixture Modeling by Affinity Propagation

20 0.24656217 79 nips-2005-Fusion of Similarity Data in Clustering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.034), (10, 0.425), (27, 0.042), (31, 0.064), (34, 0.074), (39, 0.022), (55, 0.021), (57, 0.01), (69, 0.043), (73, 0.045), (83, 0.014), (88, 0.083), (91, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98321897 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1

2 0.95721519 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The ﬁrst is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of ﬂow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see ﬁgure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be ﬁxed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of ﬁnding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. speciﬁc cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using ﬂow to ﬁnd cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore beneﬁt from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in ﬁgure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in ﬁgure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current ﬂowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current ﬂows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in ﬁgure 1(b). If the embedding method were somehow not based on current, but rather on ﬂow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by ﬁgure 1(c), and the best cut would be revealed. Because theoretical ﬂow-like embedding methods such as [14] are currently not practical, we point out that in cases like ﬁgure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max ﬂow problem with the ﬂow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different ﬂow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midﬂow, looks for β-balanced cuts. The input to Midﬂow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the ﬁrst βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where ﬂow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max ﬂow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by inﬁnite capacity arcs. This max-ﬂow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than ﬁnding a spectral embedding with ARPACK. When the goal is ﬁnding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of ﬁnding good quotient cuts.] 2.1 A concrete example The plot in ﬁgure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by ﬂowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a

3 0.94062233 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

4 0.93611771 156 nips-2005-Prediction and Change Detection

Author: Mark Steyvers, Scott Brown

Abstract: We measure the ability of human observers to predict the next datum in a sequence that is generated by a simple statistical process undergoing change at random points in time. Accurate performance in this task requires the identification of changepoints. We assess individual differences between observers both empirically, and using two kinds of models: a Bayesian approach for change detection and a family of cognitively plausible fast and frugal models. Some individuals detect too many changes and hence perform sub-optimally due to excess variability. Other individuals do not detect enough changes, and perform sub-optimally because they fail to notice short-term temporal trends. 1 Intr oduction Decision-making often requires a rapid response to change. For example, stock analysts need to quickly detect changes in the market in order to adjust investment strategies. Coaches need to track changes in a player’s performance in order to adjust strategy. When tracking changes, there are costs involved when either more or less changes are observed than actually occurred. For example, when using an overly conservative change detection criterion, a stock analyst might miss important short-term trends and interpret them as random fluctuations instead. On the other hand, a change may also be detected too readily. For example, in basketball, a player who makes a series of consecutive baskets is often identified as a “hot hand” player whose underlying ability is perceived to have suddenly increased [1,2]. This might lead to sub-optimal passing strategies, based on random fluctuations. We are interested in explaining individual differences in a sequential prediction task. Observers are shown stimuli generated from a simple statistical process with the task of predicting the next datum in the sequence. The latent parameters of the statistical process change discretely at random points in time. Performance in this task depends on the accurate detection of those changepoints, as well as inference about future outcomes based on the outcomes that followed the most recent inferred changepoint. There is much prior research in statistics on the problem of identifying changepoints [3,4,5]. In this paper, we adopt a Bayesian approach to the changepoint identification problem and develop a simple inference procedure to predict the next datum in a sequence. The Bayesian model serves as an ideal observer model and is useful to characterize the ways in which individuals deviate from optimality. The plan of the paper is as follows. We first introduce the sequential prediction task and discuss a Bayesian analysis of this prediction problem. We then discuss the results from a few individuals in this prediction task and show how the Bayesian approach can capture individual differences with a single “twitchiness” parameter that describes how readily changes are perceived in random sequences. We will show that some individuals are too twitchy: their performance is too variable because they base their predictions on too little of the recent data. Other individuals are not twitchy enough, and they fail to capture fast changes in the data. We also show how behavior can be explained with a set of fast and frugal models [6]. These are cognitively realistic models that operate under plausible computational constraints. 2 A pr ediction task wit h m ult iple c hange points In the prediction task, stimuli are presented sequentially and the task is to predict the next stimulus in the sequence. After t trials, the observer has been presented with stimuli y1, y2, …, yt and the task is to make a prediction about yt+1. After the prediction is made, the actual outcome yt+1 is revealed and the next trial proceeds to the prediction of yt+2. This procedure starts with y1 and is repeated for T trials. The observations yt are D-dimensional vectors with elements sampled from binomial distributions. The parameters of those distributions change discretely at random points in time such that the mean increases or decreases after a change point. This generates a sequence of observation vectors, y1, y2, …, yT, where each yt = {yt,1 … yt,D}. Each of the yt,d is sampled from a binomial distribution Bin(θt,d,K), so 0 ≤ yt,d ≤ K. The parameter vector θt ={θt,1 … θt,D} changes depending on the locations of the changepoints. At each time step, xt is a binary indicator for the occurrence of a changepoint occurring at time t+1. The parameter α determines the probability of a change occurring in the sequence. The generative model is specified by the following algorithm: 1. For d=1..D sample θ1,d from a Uniform(0,1) distribution 2. For t=2..T, (a) Sample xt-1 from a Bernoulli(α) distribution (b) If xt-1=0, then θt=θt-1, else for d=1..D sample θt,d from a Uniform(0,1) distribution (c) for d=1..D, sample yt from a Bin(θt,d,K) distribution Table 1 shows some data generated from the changepoint model with T=20, α=.1,and D=1. In the prediction task, y will be observed, but x and θ are not. Table 1: Example data t x θ y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 .68 .68 .68 .68 .48 .48 .48 .74 .74 .74 .74 .74 .74 .19 .19 .87 .87 .87 .87 .87 9 7 8 7 4 4 4 9 8 3 6 7 8 2 1 8 9 9 8 8 3 A Bayesian pr ediction m ode l In both our Bayesian and fast-and-frugal analyses, the prediction task is decomposed into two inference procedures. First, the changepoint locations are identified. This is followed by predictive inference for the next outcome based on the most recent changepoint locations. Several Bayesian approaches have been developed for changepoint problems involving single or multiple changepoints [3,5]. We apply a Markov Chain Monte Carlo (MCMC) analysis to approximate the joint posterior distribution over changepoint assignments x while integrating out θ. Gibbs sampling will be used to sample from this posterior marginal distribution. The samples can then be used to predict the next outcome in the sequence. 3.1 I n f e r e nc e f o r c h a n g e p o i n t a s s i g n m e n t s . To apply Gibbs sampling, we evaluate the conditional probability of assigning a changepoint at time i, given all other changepoint assignments and the current α value. By integrating out θ, the conditional probability is P ( xi | x−i , y, α ) = ∫ P ( xi ,θ , α | x− i , y ) (1) θ where x− i represents all switch point assignments except xi. This can be simplified by considering the location of the most recent changepoint preceding and following time i and the outcomes occurring between these locations. Let niL be the number of time steps from the last changepoint up to and including the current time step i such that xi − nL =1 and xi − nL + j =0 for 0 < niL . Similarly, let niR be the number of time steps that i i follow time step i up to the next changepoint such that xi + n R =1 and xi + nR − j =0 for i R i 0 < n . Let y = L i ∑ i − niL < k ≤ i i yk and y = ∑ k < k ≤i + n R yk . The update equation for the R i i changepoint assignment can then be simplified to P ( xi = m | x−i ) ∝ ( ) ( ( ) D Γ 1 + y L + y R Γ 1 + Kn L + Kn R − y L − y R ⎧ i, j i, j i i i, j i, j ⎪ (1 − α ) ∏ L R Γ 2 + Kni + Kni ⎪ j =1 ⎪ ⎨ L L L R R R ⎪ D Γ 1 + yi, j Γ 1 + Kni − yi, j Γ 1 + yi, j Γ 1 + Kni − yi, j α∏ ⎪ Γ 2 + KniL Γ 2 + KniR ⎪ j =1 ⎩ ( ) ( ( ) ( ) ( ) ) ( ) m=0 ) (2) m =1 We initialize the Gibbs sampler by sampling each xt from a Bernoulli(α) distribution. All changepoint assignments are then updated sequentially by the Gibbs sampling equation above. The sampler is run for M iterations after which one set of changepoint assignments is saved. The Gibbs sampler is then restarted multiple times until S samples have been collected. Although we could have included an update equation for α, in this analysis we treat α as a known constant. This will be useful when characterizing the differences between human observers in terms of differences in α. 3.2 P r e d i c ti v e i n f er e n ce The next latent parameter value θt+1 and outcome yt+1 can be predicted on the basis of observed outcomes that occurred after the last inferred changepoint: θ t+1, j = t ∑ i =t* +1 yt+1, j = round (θt +1, j K ) yi, j / K , (3) where t* is the location of the most recent change point. By considering multiple Gibbs samples, we get a distribution over outcomes yt+1. We base the model predictions on the mean of this distribution. 3.3 I l l u s t r a t i o n o f m o d e l p er f o r m a n c e Figure 1 illustrates the performance of the model on a one dimensional sequence (D=1) generated from the changepoint model with T=160, α=0.05, and K=10. The Gibbs sampler was run for M=30 iterations and S=200 samples were collected. The top panel shows the actual changepoints (triangles) and the distribution of changepoint assignments averaged over samples. The bottom panel shows the observed data y (thin lines) as well as the θ values in the generative model (rescaled between 0 and 10). At locations with large changes between observations, the marginal changepoint probability is quite high. At other locations, the true change in the mean is very small, and the model is less likely to put in a changepoint. The lower right panel shows the distribution over predicted θt+1 values. xt 1 0.5 0 yt 10 1 5 θt+1 0.5 0 20 40 60 80 100 120 140 160 0 Figure 1. Results of model simulation. 4 Prediction experiment We tested performance of 9 human observers in the prediction task. The observers included the authors, a visitor, and one student who were aware of the statistical nature of the task as well as naïve students. The observers were seated in front of an LCD touch screen displaying a two-dimensional grid of 11 x 11 buttons. The changepoint model was used to generate a sequence of T=1500 stimuli for two binomial variables y1 and y2 (D=2, K=10). The change probability α was set to 0.1. The two variables y1 and y2 specified the two-dimensional button location. The same sequence was used for all observers. On each trial, the observer touched a button on the grid displayed on the touch screen. Following each button press, the button corresponding to the next {y1,y2} outcome in the sequence was highlighted. Observers were instructed to press the button that best predicted the next location of the highlighted button. The 1500 trials were divided into three blocks of 500 trials. Breaks were allowed between blocks. The whole experiment lasted between 15 and 30 minutes. Figure 2 shows the first 50 trials from the third block of the experiment. The top and bottom panels show the actual outcomes for the y1 and y2 button grid coordinates as well as the predictions for two observers (SB and MY). The figure shows that at trial 15, the y1 and y2 coordinates show a large shift followed by an immediate shift in observer’s MY predictions (on trial 16). Observer SB waits until trial 17 to make a shift. 10 5 0 outcomes SB predictions MY predictions 10 5 0 0 5 10 15 20 25 Trial 30 35 40 45 50 Figure 2. Trial by trial predictions from two observers. 4.1 T a s k er r o r We assessed prediction performance by comparing the prediction with the actual outcome in the sequence. Task error was measured by normalized city-block distance T 1 (4) task error= ∑ yt ,1 − ytO,1 + yt ,2 − ytO,2 (T − 1) t =2 where yO represents the observer’s prediction. Note that the very first trial is excluded from this calculation. Even though more suitable probabilistic measures for prediction error could have been adopted, we wanted to allow comparison of observer’s performance with both probabilistic and non-probabilistic models. Task error ranged from 2.8 (for participant MY) to 3.3 (for ML). We also assessed the performance of five models – their task errors ranged from 2.78 to 3.20. The Bayesian models (Section 3) had the lowest task errors, just below 2.8. This fits with our definition of the Bayesian models as “ideal observer” models – their task error is lower than any other model’s and any human observer’s task error. The fast and frugal models (Section 5) had task errors ranging from 2.85 to 3.20. 5 Modeling R esults We will refer to the models with the following letter codes: B=Bayesian Model, LB=limited Bayesian model, FF1..3=fast and frugal models 1..3. We assessed model fit by comparing the model’s prediction against the human observers’ predictions, again using a normalized city-block distance model error= T 1 ∑ ytM − ytO,1 + ytM − ytO,2 ,1 ,2 (T − 1) t=2 (5) where yM represents the model’s prediction. The model error for each individual observer is shown in Figure 3. It is important to note that because each model is associated with a set of free parameters, the parameters optimized for task error and model error are different. For Figure 3, the parameters were optimized to minimize Equation (5) for each individual observer, showing the extent to which these models can capture the performance of individual observers, not necessarily providing the best task performance. B LB FF1 FF2 MY MS MM EJ FF3 Model Error 2 1.5 1 0.5 0 PH NP DN SB ML 1 Figure 3. Model error for each individual observer. 5.1 B ay e s i a n p re d i ct i o n m o d e l s At each trial t, the model was provided with the sequence of all previous outcomes. The Gibbs sampling and inference procedures from Eq. (2) and (3) were applied with M=30 iterations and S=200 samples. The change probability α was a free parameter. In the full Bayesian model, the whole sequence of observations up to the current trial is available for prediction, leading to a memory requirement of up to T=1500 trials – a psychologically unreasonable assumption. We therefore also simulated a limited Bayesian model (LB) where the observed sequence was truncated to the last 10 outcomes. The LB model showed almost no decrement in task performance compared to the full Bayesian model. Figure 3 also shows that it fit human data quite well. 5.2 I n d i v i d u a l D i f f er e nc e s The right-hand panel of Figure 4 plots each observer’s task error as a function of the mean city-block distance between their subsequent button presses. This shows a clear U-shaped function. Observers with very variable predictions (e.g., ML and DN) had large average changes between successive button pushes, and also had large task error: These observers were too “twitchy”. Observers with very small average button changes (e.g., SB and NP) were not twitchy enough, and also had large task error. Observers in the middle had the lowest task error (e.g., MS and MY). The left-hand panel of Figure 4 shows the same data, but with the x-axis based on the Bayesian model fits. Instead of using mean button change distance to index twitchiness (as in 1 Error bars indicate bootstrapped 95% confidence intervals. the right-hand panel), the left-hand panel uses the estimated α parameters from the Bayesian model. A similar U-shaped pattern is observed: individuals with too large or too small α estimates have large task errors. 3.3 DN 3.2 Task Error ML SB 3.2 NP 3.1 Task Error 3.3 PH EJ 3 MM MS MY 2.9 2.8 10 -4 10 -3 10 -2 DN NP 3.1 3 PH EJ MM MS 2.9 B ML SB MY 2.8 10 -1 10 0 0.5 1 α 1.5 2 Mean Button Change 2.5 3 Figure 4. Task error vs. “twitchiness”. Left-hand panel indexes twitchiness using estimated α parameters from Bayesian model fits. Right-hand panel uses mean distance between successive predictions. 5.3 F a s t - a n d - F r u g a l ( F F ) p r e d ic t i o n m o d e l s These models perform the prediction task using simple heuristics that are cognitively plausible. The FF models keep a short memory of previous stimulus values and make predictions using the same two-step process as the Bayesian model. First, a decision is made as to whether the latent parameter θ has changed. Second, remembered stimulus values that occurred after the most recently detected changepoint are used to generate the next prediction. A simple heuristic is used to detect changepoints: If the distance between the most recent observation and prediction is greater than some threshold amount, a change is inferred. We defined the distance between a prediction (p) and an observation (y) as the difference between the log-likelihoods of y assuming θ=p and θ=y. Thus, if fB(.|θ, K) is the binomial density with parameters θ and K, the distance between observation y and prediction p is defined as d(y,p)=log(fB(y|y,K))-log(fB(y|p,K)). A changepoint on time step t+1 is inferred whenever d(yt,pt)>C. The parameter C governs the twitchiness of the model predictions. If C is large, only very dramatic changepoints will be detected, and the model will be too conservative. If C is small, the model will be too twitchy, and will detect changepoints on the basis of small random fluctuations. Predictions are based on the most recent M observations, which are kept in memory, unless a changepoint has been detected in which case only those observations occurring after the changepoint are used for prediction. The prediction for time step t+1 is simply the mean of these observations, say p. Human observers were reticent to make predictions very close to the boundaries. This was modeled by allowing the FF model to change its prediction for the next time step, yt+1, towards the mean prediction (0.5). This change reflects a two-way bet. If the probability of a change occurring is α, the best guess will be 0.5 if that change occurs, or the mean p if the change does not occur. Thus, the prediction made is actually yt+1=1/2 α+(1-α)p. Note that we do not allow perfect knowledge of the probability of a changepoint, α. Instead, an estimated value of α is used based on the number of changepoints detected in the data series up to time t. The FF model nests two simpler FF models that are psychologically interesting. If the twitchiness threshold parameter C becomes arbitrarily large, the model never detects a change and instead becomes a continuous running average model. Predictions from this model are simply a boxcar smooth of the data. Alternatively, if we assume no memory the model must based each prediction on only the previous stimulus (i.e., M=1). Above, in Figure 3, we labeled the complete FF model as FF1, the boxcar model as FF2 and the memoryless model FF3. Figure 3 showed that the complete FF model (FF1) fit the data from all observers significantly better than either the boxcar model (FF2) or the memoryless model (FF3). Exceptions were observers PH, DN and ML, for whom all three FF model fit equally well. This result suggests that our observers were (mostly) doing more than just keeping a running average of the data, or using only the most recent observation. The FF1 model fit the data about as well as the Bayesian models for all observers except MY and MS. Note that, in general, the FF1 and Bayesian model fits are very good: the average city block distance between the human data and the model prediction is around 0.75 (out of 10) buttons on both the x- and y-axes. 6 C onclusion We used an online prediction task to study changepoint detection. Human observers had to predict the next observation in stochastic sequences containing random changepoints. We showed that some observers are too “twitchy”: They perform poorly on the prediction task because they see changes where only random fluctuation exists. Other observers are not twitchy enough, and they perform poorly because they fail to see small changes. We developed a Bayesian changepoint detection model that performed the task optimally, and also provided a good fit to human data when sub-optimal parameter settings were used. Finally, we developed a fast-and-frugal model that showed how participants may be able to perform well at the task using minimal information and simple decision heuristics. Acknowledgments We thank Eric-Jan Wagenmakers and Mike Yi for useful discussions related to this work. This work was supported in part by a grant from the US Air Force Office of Scientific Research (AFOSR grant number FA9550-04-1-0317). R e f er e n ce s [1] Gilovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: on the misperception of random sequences. Cognitive Psychology17, 295-314. [2] Albright, S.C. (1993a). A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association, 88, 1175-1183. [3] Stephens, D.A. (1994). Bayesian retrospective multiple changepoint identification. Applied Statistics 43(1), 159-178. [4] Carlin, B.P., Gelfand, A.E., & Smith, A.F.M. (1992). Hierarchical Bayesian analysis of changepoint problems. Applied Statistics 41(2), 389-405. [5] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711-732. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.

same-paper 5 0.894651 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

6 0.61313409 177 nips-2005-Size Regularized Cut for Data Clustering

7 0.59553903 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

8 0.58370519 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

9 0.5832482 53 nips-2005-Cyclic Equilibria in Markov Games

10 0.57993454 46 nips-2005-Consensus Propagation

11 0.57374811 153 nips-2005-Policy-Gradient Methods for Planning

12 0.56107539 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

13 0.55855125 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

14 0.55095041 23 nips-2005-An Application of Markov Random Fields to Range Sensing

15 0.55034328 144 nips-2005-Off-policy Learning with Options and Recognizers

16 0.5468747 34 nips-2005-Bayesian Surprise Attracts Human Attention

17 0.53927088 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

18 0.53795177 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

19 0.53292453 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

20 0.53199172 48 nips-2005-Context as Filtering