nips nips2012 nips2012-2 knowledge-graph by maker-knowledge-mining

2 nips-2012-3D Social Saliency from Head-mounted Cameras


Source: pdf

Author: Hyun S. Park, Eakta Jain, Yaser Sheikh

Abstract: A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. It is a strong indicator of social saliency because the attention of the participating group is focused on that point. In scenes occupied by large groups of people, multiple concurrences may occur and transition over time. In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. [sent-6, score-1.884]

2 In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. [sent-9, score-1.505]

3 We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. [sent-10, score-0.897]

4 We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. [sent-11, score-1.287]

5 The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. [sent-12, score-1.059]

6 We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. [sent-13, score-1.311]

7 Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. [sent-14, score-1.225]

8 For instance, humans spontaneously orient their gaze to the target of their attention. [sent-29, score-0.872]

9 , at an obnoxious customer at a restaurant, their gaze rays1 converge to a point that we refer to as a gaze concurrence. [sent-32, score-1.755]

10 It is an effective approximation because although an individual’s gaze indicates what he or she is subjectively interested in, a gaze concurrence encodes the consensus of multiple individuals. [sent-34, score-1.826]

11 In this paper, we present a method to reconstruct a 3D social saliency ďŹ eld and localize 3D gaze concurrences from videos taken by head-mounted cameras 1 A gaze ray is a three dimensional ray emitted from the center of eyes and oriented to the point of regard as shown in Figure 1(b). [sent-36, score-2.896]

12 (b) The primary gaze ray is a ďŹ xed 3D ray with respect to the head coordinate system and the gaze ray can be described by an angle with respect to the primary gaze ray. [sent-38, score-3.507]

13 (c) The variation of the eye orientation is parameterized by a Gaussian distribution of the points on the plane, Π, which is normal to the primary gaze ray, l at unit distance from p. [sent-39, score-1.096]

14 (d) The gaze ray model results in a cone-shaped distribution of the point of regard. [sent-40, score-1.09]

15 Our method automatically ďŹ nds the number and location of gaze concurrences that may occur as people form social cliques in an environment. [sent-42, score-1.307]

16 Estimating 3D gaze concurrences requires accurate estimates of the gaze of people who are widely distributed over the social space. [sent-44, score-2.168]

17 As a result, 3D pose estimation of head-mounted cameras provides accurate and spatially unbiased estimates of the primary gaze ray2 . [sent-53, score-1.08]

18 This is enabled by a new model of gaze rays that represents the variation due to eye-in-head motion via a cone-shaped distribution. [sent-57, score-0.97]

19 We present a novel method to calibrate the parameters of this model by leveraging the fact that the primary gaze ray is ďŹ xed with respect to the head-mounted camera in 3D. [sent-58, score-1.287]

20 Given the collection of gaze ray distributions in 3D space, we automatically estimate the number and 3D locations of multiple gaze concurrences via mode-seeking in the social saliency ďŹ eld. [sent-59, score-2.39]

21 Among these signals, gaze direction is one of the most prominent visible signals because it usually indicates what the individual is interested in. [sent-63, score-0.908]

22 In this context, gaze direction estimation has been widely studied in robotics, human-computer interaction, and computer vision [6, 11–22]. [sent-64, score-0.907]

23 Although all these methods can estimate highly accurate gaze direction, either they can be used in a laboratory setting or the device occludes the viewer’s ďŹ eld of view. [sent-69, score-0.894]

24 2 The primary gaze ray is a ďŹ xed eye orientation with respect to the head. [sent-70, score-1.291]

25 2 While the eyes are the primary source of gaze direction, Emery [16] notes that the head orientation is a strong indication of the direction of attention. [sent-72, score-1.2]

26 [24] presented a gaze calibration procedure based on the eye geometry using 4 head-mounted cameras. [sent-84, score-0.974]

27 Each person wears a camera on the head and performs a predeďŹ ned motion for gaze ray calibration based on our gaze ray model (Section 3. [sent-102, score-2.486]

28 From the reconstructed camera poses in conjunction with the gaze ray model, we estimate multiple gaze concurrences in 3D via mode-seeking (Section 3. [sent-106, score-2.385]

29 1 Gaze Ray Model We represent the direction of the viewer’s gaze as a 3D ray that is emitted from the center of the eyes and is directed towards the point of regard, as shown in Figure 1(b). [sent-113, score-1.205]

30 The center of the eyes is ďŹ xed with respect to the head position and therefore, the orientation of the gaze ray in the world coordinate system is a composite of the head orientation and the eye orientation (eye-in-head motion). [sent-114, score-1.648]

31 A headmounted camera does not contain sufďŹ cient information to estimate the gaze ray because it can capture only the head position and orientation but not the eye orientation. [sent-115, score-1.465]

32 , when the point of regard is stationary or slowly moving with respect to the head pose, the eye orientation varies by a small degree [34–36] from the primary gaze ray. [sent-118, score-1.232]

33 We represent the variation of the gaze ray with respect to the primary gaze ray by a Gaussian distribution on a plane normal to the primary gaze ray. [sent-119, score-3.216]

34 The point of regard (and consequently, the gaze ray) is more likely to be near the primary gaze ray. [sent-120, score-1.882]

35 Let us deďŹ ne the primary gaze ray l by the center of the eyes p ∈ R3 , and the unit direction vector, v ∈ R3 in the world coordinate system, đ? [sent-136, score-1.291]

36 Any point on the primary gaze ray can be written as p + đ? [sent-138, score-1.173]

37 Let Π be a plane normal to the primary gaze ray l at unit distance from p, as shown in Figure 1(c). [sent-141, score-1.182]

38 Thus, the distribution of the points on the plane maps to the distribution of the gaze ray by parameterizing the 3D ray as ld (p, vd ) = p+đ? [sent-153, score-1.346]

39 The resulting distribution of 3D points of regard is a cone-shaped distribution whose central axis is the primary gaze ray, i. [sent-156, score-1.011]

40 , a point distribution on any normal plane to the primary gaze ray is a scaled Gaussian centered at the intersection between l and the plane as shown in Figure 1(d). [sent-158, score-1.223]

41 2 Gaze Ray Calibration Algorithm When a person wears a head-mounted camera, it may not be aligned with the direction of the primary gaze ray. [sent-160, score-1.017]

42 The orientation and position offsets between the head-mounted camera and the primary gaze ray must be calibrated to estimate where the person is looking. [sent-162, score-1.389]

43 The relative transform between the primary gaze ray and the camera pose is constant across time because the camera is, for the most part, stationary with respect to the head, đ? [sent-163, score-1.439]

44 Once the relative transform and camera pose have been estimated, the primary gaze ray can be recovered. [sent-165, score-1.314]

45 We learn the primary gaze ray parameters, p and v, with respect to the camera pose and the standard deviation â„Ž of eye-in-head motion. [sent-166, score-1.314]

46 ‘› is the number of the points of regard, we can infer the primary gaze ray parameters with respect to the camera pose. [sent-184, score-1.299]

47 ‘› will form a line which is the primary gaze ray. [sent-188, score-0.97]

48 ‘› will be contained in a cone whose central axis is the direction of the primary gaze ray, v, and whose apex is the center of eyes, p. [sent-192, score-1.103]

49 We ďŹ rst estimate the primary gaze line and then, ďŹ nd the center of the eye on the line to completely describe the primary gaze ray. [sent-193, score-2.032]

50 To estimate the primary gaze line robustly, we embed line estimation by two points in the RANSAC framework [32]3 . [sent-194, score-1.008]

51 ‘– is the projection of x onto the primary gaze ray, lđ? [sent-209, score-0.955]

52 (b) Our gaze ray representation results in the cone-shaped distribution in 3D. [sent-211, score-1.079]

53 (c) Two gaze concurrences are formed by seven gaze rays. [sent-212, score-2.018]

54 3 Gaze Concurrence Estimation via Mode-seeking 3D gaze concurrences are formed at the intersections of multiple gaze rays, not at the intersection of multiple primary gazes (see Figure 1(b)). [sent-293, score-2.123]

55 If we knew the 3D gaze rays, and which of rays shared a gaze concurrence, the point of intersection could be directly estimated via least squares estimation, for example. [sent-294, score-1.822]

56 With a head-mounted camera, only the primary gaze ray is computable; the eye-inhead motion is an unknown quantity. [sent-296, score-1.213]

57 This precludes estimating the 3D gaze concurrence by ďŹ nding a point of intersection, directly. [sent-297, score-0.954]

58 In this section, we present a method to estimate the number and the 3D locations of gaze concurrences given primary gaze rays. [sent-298, score-2.101]

59 Our observations from head-mounted cameras are primary gaze rays. [sent-299, score-1.042]

60 1 produces a distribution of points of regard for each primary gaze ray. [sent-301, score-1.011]

61 We derive the closed form of the mean-shift vector directly from the observed primary gaze rays. [sent-306, score-0.955]

62 ž which evaluate the distance vector between the point, x, and the primary gaze rays lđ? [sent-314, score-1.002]

63 ‘ is the number of gaze rays and â„Žđ? [sent-340, score-0.919]

64 ‘– is a bandwidth set to be the standard deviation of eye-inhead motion obtained from the gaze ray calibration (Section 3. [sent-341, score-1.151]

65 ‘– , which is the projection of x onto the primary gaze ray as shown in Figure 3(a). [sent-368, score-1.162]

66 This distance vector directly captures the distance between l and ld in the gaze ray model (Section 3. [sent-376, score-1.093]

67 Figure 3(c) shows a social saliency ďŹ eld (density ďŹ eld) generated by seven gaze rays. [sent-378, score-1.059]

68 The regions of high density are the gaze concurrences. [sent-379, score-0.894]

69 In the calibration step, we ask people to form pairs, and move back and forth and side to side at least three times to allow the gaze ray model to be accurately estimated. [sent-466, score-1.158]

70 For the initial points of the mean-shift algorithm, we sample several points on the primary gaze rays. [sent-467, score-0.979]

71 , more than one gaze rays must contribute to estimate a gaze concurrence. [sent-471, score-1.791]

72 1 Validation with Motion Capture Data We compare the 3D gaze concurrences estimated by our result with ground truth obtained from a motion capture system (capture volume: 8. [sent-473, score-1.259]

73 Therefore, the 3D gaze concurrences estimated by our algorithm should coincide with the 3D position of the static markers. [sent-480, score-1.17]

74 The top row in Figure 4(a) shows the trajectories of the gaze concurrences (solid lines) overlaid by the static marker positions (dotted lines). [sent-481, score-1.175]

75 The bottom row in Figure 4(a) shows the gaze concurrences (orange and red points) with the ground truth positions (green and blue points) and the conďŹ dence regions (pink region) where a high value of the saliency ďŹ eld is achieved (region which has higher than 80% of the local maximum value). [sent-485, score-1.279]

76 2 Real World Scenes We apply our method to reconstruct 3D gaze concurrences in three real world scenes: a meeting, a musical, and a party. [sent-488, score-1.184]

77 Figures 4(b), 5(a), and 5(b) show the reconstructed gaze concurrences and the projections of 3D gaze concurrences onto the head-mounted camera plane (top row). [sent-489, score-2.458]

78 3D renderings of the gaze concurrences (red dots) with the associated conďŹ dence region (salient region) are drawn in the middle row and the cone-shaped gaze ray models are also shown. [sent-490, score-2.235]

79 The trajectories of the gaze concurrences are shown in the bottom row. [sent-491, score-1.165]

80 The people in each group started to discuss among themselves at the beginning (2 gaze concurrences). [sent-494, score-0.946]

81 After a few minutes, all the people faced the presenter in the middle (50th frame: 1 gaze concurrence), and then they went back to their group to discuss again (445th frame: 2 gaze concurrences) as shown in Figure 4(b). [sent-495, score-1.818]

82 Party scene: there were 11 people forming 4 groups: 3 sat on couches, 3 talked to each other at the table, 3 played table tennis, and 2 played pool (178th frame: 4 gaze concurrences) as shown in Figure 5(b). [sent-502, score-0.93]

83 Then, all moved to watch the table tennis game (710th frame: one gaze concurrence). [sent-503, score-0.884]

84 Our method correctly evaluates the gaze concurrences at the location where people look. [sent-504, score-1.204]

85 5 Discussion In this paper, we present a novel representation for social scene understanding in terms of 3D gaze concurrences. [sent-510, score-1.024]

86 We reconstruct the head-mounted camera poses in 3D using structure from motion and estimate the relationship between the camera pose and the gaze ray. [sent-512, score-1.238]

87 Our mode-seeking algorithm ďŹ nds the multiple time-varying gaze concurrences in 3D. [sent-513, score-1.146]

88 When people’s gaze rays are almost parallel, as in the musical scene (Figure 5(a)), the estimated gaze concurrences become poorly conditioned. [sent-515, score-2.15]

89 The conďŹ dence region is stretched along the direction of the primary gaze rays. [sent-516, score-0.979]

90 For such a scene, head-mounted cameras from different points of views can help to localize the gaze concurrences precisely. [sent-518, score-1.245]

91 A future application of this work will be to use gaze concurrence to allow artiďŹ cial agents, such as robots, to become collaborative team members that recognize and respond to social cues, rather than passive tools that require prompting. [sent-520, score-1.035]

92 Bottom: there are two gaze concurrences with six people. [sent-523, score-1.146]

93 (b) We reconstruct the gaze concurrences for the meeting scene. [sent-524, score-1.186]

94 Top row: images with the reprojection of the gaze concurrences, middle row: rendering of the 3D gaze concurrences with cone-shaped gaze models, bottom row: the trajectories of the gaze concurrences. [sent-526, score-3.781]

95 oblique view front view oblique view oblique view top view oblique view 178th frame: 4 gaze concurrences top view oblique view 710th frame: 1 gaze concurrence (a) Musical scene (b) Party scene Figure 5: (a) We reconstruct the gaze concurrences from musical audiences. [sent-527, score-3.811]

96 (b) We reconstruct the gaze concurrences for the party scene. [sent-529, score-1.187]

97 Top row: images with the reprojection of the gaze concurrences, bottom row: rendering of the 3D gaze concurrences with cone-shaped gaze models. [sent-531, score-2.89]

98 We are interested in studying the spatiotemporal characteristics of the birth and death of gaze concurrences and how they relate to the groups in the scene. [sent-533, score-1.146]

99 General theory of remote gaze estimation using the pupil center and corneal reection. [sent-607, score-0.92]

100 Estimating 3D point-of-regard and visualizing gaze trajectories under natural head movements. [sent-628, score-0.984]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gaze', 0.872), ('concurrences', 0.274), ('ray', 0.207), ('camera', 0.125), ('head', 0.093), ('social', 0.092), ('cameras', 0.087), ('primary', 0.083), ('apex', 0.077), ('saliency', 0.073), ('concurrence', 0.071), ('eye', 0.067), ('eyes', 0.066), ('orientation', 0.062), ('people', 0.058), ('motion', 0.051), ('scene', 0.05), ('oblique', 0.047), ('rays', 0.047), ('regard', 0.044), ('frame', 0.029), ('pose', 0.027), ('person', 0.026), ('scenes', 0.026), ('center', 0.025), ('musical', 0.025), ('direction', 0.024), ('view', 0.024), ('reconstruct', 0.024), ('etra', 0.024), ('cone', 0.022), ('eld', 0.022), ('density', 0.022), ('reconstructed', 0.021), ('calibration', 0.021), ('plane', 0.02), ('videos', 0.02), ('trajectories', 0.019), ('party', 0.017), ('meeting', 0.016), ('modes', 0.016), ('bazzani', 0.016), ('cristani', 0.016), ('sang', 0.016), ('group', 0.016), ('orange', 0.015), ('line', 0.015), ('ground', 0.015), ('world', 0.014), ('geometry', 0.014), ('ld', 0.014), ('vd', 0.014), ('face', 0.014), ('position', 0.014), ('poses', 0.014), ('truth', 0.013), ('capture', 0.013), ('occupied', 0.012), ('teams', 0.012), ('convergences', 0.012), ('corneal', 0.012), ('gazes', 0.012), ('gee', 0.012), ('grapp', 0.012), ('headmounted', 0.012), ('hennessey', 0.012), ('hodgins', 0.012), ('ladies', 0.012), ('menegaz', 0.012), ('munn', 0.012), ('pirri', 0.012), ('rae', 0.012), ('stiefelhagen', 0.012), ('tennis', 0.012), ('tosato', 0.012), ('wears', 0.012), ('yaser', 0.012), ('points', 0.012), ('faces', 0.012), ('signals', 0.012), ('pink', 0.012), ('cliques', 0.011), ('point', 0.011), ('interactions', 0.011), ('vt', 0.011), ('estimation', 0.011), ('consensus', 0.011), ('man', 0.011), ('system', 0.011), ('fathi', 0.011), ('ransac', 0.011), ('rescue', 0.011), ('takemura', 0.011), ('vantage', 0.011), ('estimated', 0.01), ('intersection', 0.01), ('locate', 0.01), ('row', 0.01), ('visual', 0.01), ('understanding', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 2 nips-2012-3D Social Saliency from Head-mounted Cameras

Author: Hyun S. Park, Eakta Jain, Yaser Sheikh

Abstract: A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. It is a strong indicator of social saliency because the attention of the participating group is focused on that point. In scenes occupied by large groups of people, multiple concurrences may occur and transition over time. In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. 1

2 0.070730381 256 nips-2012-On the connections between saliency and tracking

Author: Vijay Mahadevan, Nuno Vasconcelos

Abstract: A model connecting visual tracking and saliency has recently been proposed. This model is based on the saliency hypothesis for tracking which postulates that tracking is achieved by the top-down tuning, based on target features, of discriminant center-surround saliency mechanisms over time. In this work, we identify three main predictions that must hold if the hypothesis were true: 1) tracking reliability should be larger for salient than for non-salient targets, 2) tracking reliability should have a dependence on the defining variables of saliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of saliency on these variables, and 3) saliency and tracking can be implemented with common low level neural mechanisms. We confirm that the first two predictions hold by reporting results from a set of human behavior studies on the connection between saliency and tracking. We also show that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking. This architecture is fully compliant with the standard physiological models of V1 and MT, and with what is known about attentional control in area LIP, while explaining the results of the human behavior experiments.

3 0.060420651 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

Author: Sejong Yoon, Vladimir Pavlovic

Abstract: Probabilistic approaches to computer vision typically assume a centralized setting, with the algorithm granted access to all observed data points. However, many problems in wide-area surveillance can benefit from distributed modeling, either because of physical or computational constraints. Most distributed models to date use algebraic approaches (such as distributed SVD) and as a result cannot explicitly deal with missing data. In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA and missing-data PPCA, can be learned when the data is distributed across a network of sensors. We demonstrate the utility of this approach on the problem of distributed affine structure from motion. Our experiments suggest that the accuracy of the learned probabilistic structure and motion models rivals that of traditional centralized factorization methods while being able to handle challenging situations such as missing or noisy observations. 1

4 0.034963246 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

5 0.034771029 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a specific pattern of recurrent connections learned by the model. 1

6 0.034770239 94 nips-2012-Delay Compensation with Dynamical Synapses

7 0.034345608 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

8 0.030048577 201 nips-2012-Localizing 3D cuboids in single-view images

9 0.029994477 185 nips-2012-Learning about Canonical Views from Internet Image Collections

10 0.028755745 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.027596902 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

12 0.02403127 194 nips-2012-Learning to Discover Social Circles in Ego Networks

13 0.023344774 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

14 0.020038718 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

15 0.019341519 286 nips-2012-Random Utility Theory for Social Choice

16 0.019091612 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

17 0.018887108 210 nips-2012-Memorability of Image Regions

18 0.018771652 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

19 0.018357584 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

20 0.018088689 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.048), (1, 0.006), (2, -0.041), (3, 0.012), (4, 0.01), (5, 0.007), (6, 0.0), (7, -0.021), (8, -0.002), (9, 0.002), (10, -0.033), (11, -0.004), (12, 0.013), (13, -0.044), (14, 0.008), (15, 0.018), (16, -0.002), (17, -0.012), (18, -0.029), (19, -0.037), (20, -0.002), (21, 0.012), (22, -0.018), (23, 0.025), (24, -0.01), (25, 0.022), (26, 0.005), (27, 0.047), (28, 0.075), (29, -0.014), (30, 0.013), (31, 0.033), (32, 0.023), (33, -0.026), (34, 0.018), (35, 0.019), (36, 0.061), (37, -0.096), (38, 0.006), (39, 0.039), (40, -0.029), (41, 0.04), (42, -0.015), (43, -0.05), (44, 0.009), (45, 0.034), (46, 0.021), (47, -0.006), (48, -0.034), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92135388 2 nips-2012-3D Social Saliency from Head-mounted Cameras

Author: Hyun S. Park, Eakta Jain, Yaser Sheikh

Abstract: A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. It is a strong indicator of social saliency because the attention of the participating group is focused on that point. In scenes occupied by large groups of people, multiple concurrences may occur and transition over time. In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. 1

2 0.51691931 94 nips-2012-Delay Compensation with Dynamical Synapses

Author: Chi Fung, K. Wong, Si Wu

Abstract: Time delay is pervasive in neural information processing. To achieve real-time tracking, it is critical to compensate the transmission and processing delays in a neural system. In the present study we show that dynamical synapses with shortterm depression can enhance the mobility of a continuous attractor network to the extent that the system tracks time-varying stimuli in a timely manner. The state of the network can either track the instantaneous position of a moving stimulus perfectly (with zero-lag) or lead it with an effectively constant time, in agreement with experiments on the head-direction systems in rodents. The parameter regions for delayed, perfect and anticipative tracking correspond to network states that are static, ready-to-move and spontaneously moving, respectively, demonstrating the strong correlation between tracking performance and the intrinsic dynamics of the network. We also find that when the speed of the stimulus coincides with the natural speed of the network state, the delay becomes effectively independent of the stimulus amplitude.

3 0.49617279 256 nips-2012-On the connections between saliency and tracking

Author: Vijay Mahadevan, Nuno Vasconcelos

Abstract: A model connecting visual tracking and saliency has recently been proposed. This model is based on the saliency hypothesis for tracking which postulates that tracking is achieved by the top-down tuning, based on target features, of discriminant center-surround saliency mechanisms over time. In this work, we identify three main predictions that must hold if the hypothesis were true: 1) tracking reliability should be larger for salient than for non-salient targets, 2) tracking reliability should have a dependence on the defining variables of saliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of saliency on these variables, and 3) saliency and tracking can be implemented with common low level neural mechanisms. We confirm that the first two predictions hold by reporting results from a set of human behavior studies on the connection between saliency and tracking. We also show that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking. This architecture is fully compliant with the standard physiological models of V1 and MT, and with what is known about attentional control in area LIP, while explaining the results of the human behavior experiments.

4 0.48885873 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

Author: Xue-xin Wei, Alan Stocker

Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More specifically, we show that an efficient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efficient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reflect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustified post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been defined as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the specific perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a firing rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efficient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that finding appropriate constraints is generally difficult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efficient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efficient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the specific response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efficient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simplified neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efficient encoding Efficient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different definitions of “optimally” are possible, and may lead to different results. Here, we assume an efficient representation that maximizes the mutual information between the sensory variable and the population response. With this definition and an upper limit on the total firing activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This defines the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efficient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efficient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efficient coding defines the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average firing rates are all uniform in the homogeneous space. illustrates the applied efficient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efficient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efficiently represents a stimulus variable θ as described above. A stimulus θ0 elicits a specific population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual firing rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then define a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efficient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efficient neural encoding, and is generally not symmetric for non-flat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efficient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Specifically, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We first re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we define ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is defined as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we find ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufficiently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modifies the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive influence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efficient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efficient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we defined an efficient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous firing rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efficient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as defined in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efficient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides specific predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efficient (en)coding and Bayesian decoding. Efficient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efficient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efficient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at first sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efficient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation specificity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efficient neural population codes. In Lecture Notes in Computer Science, Artificial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efficient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9

5 0.47312438 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

6 0.41650936 185 nips-2012-Learning about Canonical Views from Internet Image Collections

7 0.41129324 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

8 0.40161452 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

9 0.3797304 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

10 0.37953919 201 nips-2012-Localizing 3D cuboids in single-view images

11 0.37345213 40 nips-2012-Analyzing 3D Objects in Cluttered Images

12 0.36046791 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

13 0.35898277 195 nips-2012-Learning visual motion in recurrent neural networks

14 0.35467181 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

15 0.35057509 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

16 0.34417859 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

17 0.33842164 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

18 0.33691472 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

19 0.33549824 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

20 0.32395265 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (17, 0.013), (21, 0.052), (37, 0.012), (38, 0.049), (42, 0.018), (44, 0.011), (46, 0.014), (54, 0.033), (55, 0.014), (60, 0.023), (61, 0.331), (74, 0.076), (76, 0.112), (80, 0.05), (92, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75700492 2 nips-2012-3D Social Saliency from Head-mounted Cameras

Author: Hyun S. Park, Eakta Jain, Yaser Sheikh

Abstract: A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. It is a strong indicator of social saliency because the attention of the participating group is focused on that point. In scenes occupied by large groups of people, multiple concurrences may occur and transition over time. In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. 1

2 0.63637024 22 nips-2012-A latent factor model for highly multi-relational data

Author: Rodolphe Jenatton, Nicolas L. Roux, Antoine Bordes, Guillaume R. Obozinski

Abstract: Many data such as social networks, movie preferences or knowledge bases are multi-relational, in that they describe multiple relations between entities. While there is a large body of work focused on modeling these data, modeling these multiple types of relations jointly remains challenging. Further, existing approaches tend to breakdown when the number of these types grows. In this paper, we propose a method for modeling large multi-relational datasets, with possibly thousands of relations. Our model is based on a bilinear structure, which captures various orders of interaction of the data, and also shares sparse latent factors across different relations. We illustrate the performance of our approach on standard tensor-factorization datasets where we attain, or outperform, state-of-the-art results. Finally, a NLP application demonstrates our scalability and the ability of our model to learn efficient and semantically meaningful verb representations. 1

3 0.6272679 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

Abstract: How can neural networks learn to represent information optimally? We answer this question by deriving spiking dynamics and learning dynamics directly from a measure of network performance. We find that a network of integrate-and-fire neurons undergoing Hebbian plasticity can learn an optimal spike-based representation for a linear decoder. The learning rule acts to minimise the membrane potential magnitude, which can be interpreted as a representation error after learning. In this way, learning reduces the representation error and drives the network into a robust, balanced regime. The network becomes balanced because small representation errors correspond to small membrane potentials, which in turn results from a balance of excitation and inhibition. The representation is robust because neurons become self-correcting, only spiking if the representation error exceeds a threshold. Altogether, these results suggest that several observed features of cortical dynamics, such as excitatory-inhibitory balance, integrate-and-fire dynamics and Hebbian plasticity, are signatures of a robust, optimal spike-based code. A central question in neuroscience is to understand how populations of neurons represent information and how they learn to do so. Usually, learning and information representation are treated as two different functions. From the outset, this separation seems like a good idea, as it reduces the problem into two smaller, more manageable chunks. Our approach, however, is to study these together. This allows us to treat learning and information representation as two sides of a single mechanism, operating at two different timescales. Experimental work has given us several clues about the regime in which real networks operate in the brain. Some of the most prominent observations are: (a) high trial-to-trial variability—a neuron responds differently to repeated, identical inputs [1, 2]; (b) asynchronous firing at the network level—spike trains of different neurons are at most very weakly correlated [3, 4, 5]; (c) tight balance of excitation and inhibition—every excitatory input is met by an inhibitory input of equal or greater size [6, 7, 8] and (4) spike-timing-dependent plasticity (STDP)—the strength of synapses change as a function of presynaptic and postsynaptic spike times [9]. Previously, it has been shown that observations (a)–(c) can be understood as signatures of an optimal, spike-based code [10, 11]. The essential idea is to derive spiking dynamics from the assumption that neurons only fire if their spike improves information representation. Information in a network may ∗ Authors contributed equally 1 originate from several possible sources: external sensory input, external neural network input, or alternatively, it may originate within the network itself as a memory, or as a computation. Whatever the source, this initial assumption leads directly to the conclusion that a network of integrate-and-fire neurons can optimally represent a signal while exhibiting properties (a)–(c). A major problem with this framework is that network connectivity must be completely specified a priori, and requires the tuning of N 2 parameters, where N is the number of neurons in the network. Although this is feasible mathematically, it is unclear how a real network could tune itself into this optimal regime. In this work, we solve this problem using a simple synaptic learning rule. The key insight is that the plasticity rule can be derived from the same basic principle as the spiking rule in the earlier work—namely, that any change should improve information representation. Surprisingly, this can be achieved with a local, Hebbian learning rule, where synaptic plasticity is proportional to the product of presynaptic firing rates with post-synaptic membrane potentials. Spiking and synaptic plasticity then work hand in hand towards the same goal: the spiking of a neuron decreases the representation error on a fast time scale, thereby giving rise to the actual population representation; synaptic plasticity decreases the representation error on a slower time scale, thereby improving or maintaining the population representation. For a large set of initial connectivities and spiking dynamics, neural networks are driven into a balanced regime, where excitation and inhibition cancel each other and where spike trains are asynchronous and irregular. Furthermore, the learning rule that we derive reproduces the main features of STDP (property (d) above). In this way, a network can learn to represent information optimally, with synaptic, neural and network dynamics consistent with those observed experimentally. 1 Derivation of the learning rule for a single neuron We begin by deriving a learning rule for a single neuron with an autapse (a self-connection) (Fig. 1A). Our approach is to derive synaptic dynamics for the autapse and spiking dynamics for the neuron such that the neuron learns to optimally represent a time-varying input signal. We will derive a learning rule for networks of neurons later, after we have developed the fundamental concepts for the single neuron case. Our first step is to derive optimal spiking dynamics for the neuron, so that we have a target for our learning rule. We do this by making two simple assumptions [11]. First, we assume that the neuron can provide an estimate or read-out x(t) of a time-dependent signal x(t) by filtering its spike train ˆ o(t) as follows: ˙ x(t) = −ˆ(t) + Γo(t), ˆ x (1) where Γ is a fixed read-out weight, which we will refer to as the neuron’s “output kernel” and the spike train can be written as o(t) = i δ(t − ti ), where {ti } are the spike times. Next, we assume that the neuron only produces a spike if that spike improves the read-out, where we measure the read-out performance through a simple squared-error loss function: 2 L(t) = x(t) − x(t) . ˆ (2) With these two assumptions, we can now derive optimal spiking dynamics. First, we observe that if the neuron produces an additional spike at time t, the read-out increases by Γ, and the loss function becomes L(t|spike) = (x(t) − (x(t) + Γ))2 . This allows us to restate our spiking rule as follows: ˆ the neuron should only produce a spike if L(t|no spike) > L(t|spike), or (x(t) − x(t))2 > (x(t) − ˆ (x(t) + Γ))2 . Now, squaring both sides of this inequality, defining V (t) ≡ Γ(x(t) − x(t)) and ˆ ˆ defining T ≡ Γ2 /2 we find that the neuron should only spike if: V (t) > T. (3) We interpret V (t) to be the membrane potential of the neuron, and we interpret T as the spike threshold. This interpretation allows us to understand the membrane potential functionally: the voltage is proportional to a prediction error—the difference between the read-out x(t) and the actual ˆ signal x(t). A spike is an error reduction mechanism—the neuron only spikes if the error exceeds the spike threshold. This is a greedy minimisation, in that the neuron fires a spike whenever that action decreases L(t) without considering the future impact of that spike. Importantly, the neuron does not require direct access to the loss function L(t). 2 To determine the membrane potential dynamics, we take the derivative of the voltage, which gives ˙ ˙ us V = Γ(x − x). (Here, and in the following, we will drop the time index for notational brevity.) ˙ ˆ ˙ Now, using Eqn. (1) we obtain V = Γx − Γ(−x + Γo) = −Γ(x − x) + Γ(x + x) − Γ2 o, so that: ˙ ˆ ˆ ˙ ˙ V = −V + Γc − Γ2 o, (4) where c = x + x is the neural input. This corresponds exactly to the dynamics of a leaky integrate˙ and-fire neuron with an inhibitory autapse1 of strength Γ2 , and a feedforward connection strength Γ. The dynamics and connectivity guarantee that a neuron spikes at just the right times to optimise the loss function (Fig. 1B). In addition, it is especially robust to noise of different forms, because of its error-correcting nature. If x is constant in time, the voltage will rise up to the threshold T at which point a spike is fired, adding a delta function to the spike train o at time t, thereby producing a read-out x that is closer to x and causing an instantaneous drop in the voltage through the autapse, ˆ by an amount Γ2 = 2T , effectively resetting the voltage to V = −T . We now have a target for learning—we know the connection strength that a neuron must have at the end of learning if it is to represent information optimally, for a linear read-out. We can use this target to derive synaptic dynamics that can learn an optimal representation from experience. Specifically, we consider an integrate-and-fire neuron with some arbitrary autapse strength ω. The dynamics of this neuron are given by ˙ V = −V + Γc − ωo. (5) This neuron will not produce the correct spike train for representing x through a linear read-out (Eqn. (1)) unless ω = Γ2 . Our goal is to derive a dynamical equation for the synapse ω so that the spike train becomes optimal. We do this by quantifying the loss that we are incurring by using the suboptimal strength, and then deriving a learning rule that minimises this loss with respect to ω. The loss function underlying the spiking dynamics determined by Eqn. (5) can be found by reversing the previous membrane potential analysis. First, we integrate the differential equation for V , assuming that ω changes on time scales much slower than the membrane potential. We obtain the following (formal) solution: V = Γx − ω¯, o (6) ˙ where o is determined by o = −¯ + o. The solution to this latter equation is o = h ∗ o, a convolution ¯ ¯ o ¯ of the spike train with the exponential kernel h(τ ) = θ(τ ) exp(−τ ). As such, it is analogous to the instantaneous firing rate of the neuron. Now, using Eqn. (6), and rewriting the read-out as x = Γ¯, we obtain the loss incurred by the ˆ o sub-optimal neuron, L = (x − x)2 = ˆ 1 V 2 + 2(ω − Γ2 )¯ + (ω − Γ2 )2 o2 . o ¯ Γ2 (7) We observe that the last two terms of Eqn. (7) will vanish whenever ω = Γ2 , i.e., when the optimal reset has been found. We can therefore simplify the problem by defining an alternative loss function, 1 2 V , (8) 2 which has the same minimum as the original loss (V = 0 or x = x, compare Eqn. (2)), but yields a ˆ simpler learning algorithm. We can now calculate how changes to ω affect LV : LV = ∂LV ∂V ∂o ¯ =V = −V o − V ω ¯ . (9) ∂ω ∂ω ∂ω We can ignore the last term in this equation (as we will show below). Finally, using simple gradient descent, we obtain a simple Hebbian-like synaptic plasticity rule: τω = − ˙ ∂LV = V o, ¯ ∂ω (10) where τ is the learning time constant. 1 This contribution of the autapse can also be interpreted as the reset of an integrate-and-fire neuron. Later, when we generalise to networks of neurons, we shall employ this interpretation. 3 This synaptic learning rule is capable of learning the synaptic weight ω that minimises the difference between x and x (Fig. 1B). During learning, the synaptic weight changes in proportion to the postˆ synaptic voltage V and the pre-synaptic firing rate o (Fig. 1C). As such, this is a Hebbian learning ¯ rule. Of course, in this single neuron case, the pre-synaptic neuron and post-synaptic neuron are the same neuron. The synaptic weight gradually approaches its optimal value Γ2 . However, it never completely stabilises, because learning never stops as long as neurons are spiking. Instead, the synapse oscillates closely about the optimal value (Fig. 1D). This is also a “greedy” learning rule, similar to the spiking rule, in that it seeks to minimise the error at each instant in time, without regard for the future impact of those changes. To demonstrate that the second term in Eqn. (5) can be neglected we note that the equations for V , o, and ω define a system ¯ of coupled differential equations that can be solved analytically by integrating between spikes. This results in a simple recurrence relation for changes in ω from the ith to the (i + 1)th spike, ωi+1 = ωi + ωi (ωi − 2T ) . τ (T − Γc − ωi ) (11) This iterative equation has a single stable fixed point at ω = 2T = Γ2 , proving that the neuron’s autaptic weight or reset will approach the optimal solution. 2 Learning in a homogeneous network We now generalise our learning rule derivation to a network of N identical, homogeneously connected neurons. This generalisation is reasonably straightforward because many characteristics of the single neuron case are shared by a network of identical neurons. We will return to the more general case of heterogeneously connected neurons in the next section. We begin by deriving optimal spiking dynamics, as in the single neuron case. This provides a target for learning, which we can then use to derive synaptic dynamics. As before, we want our network to produce spikes that optimally represent a variable x for a linear read-out. We assume that the read-out x is provided by summing and filtering the spike trains of all the neurons in the network: ˆ ˙ x = −ˆ + Γo, ˆ x (12) 2 where the row vector Γ = (Γ, . . . , Γ) contains the read-out weights of the neurons and the column vector o = (o1 , . . . , oN ) their spike trains. Here, we have used identical read-out weights for each neuron, because this indirectly leads to homogeneous connectivity, as we will demonstrate. Next, we assume that a neuron only spikes if that spike reduces a loss-function. This spiking rule is similar to the single neuron spiking rule except that this time there is some ambiguity about which neuron should spike to represent a signal. Indeed, there are many different spike patterns that provide exactly the same estimate x. For example, one neuron could fire regularly at a high rate (exactly like ˆ our previous single neuron example) while all others are silent. To avoid this firing rate ambiguity, we use a modified loss function, that selects amongst all equivalent solutions, those with the smallest neural firing rates. We do this by adding a ‘metabolic cost’ term to our loss function, so that high firing rates are penalised: ¯ L = (x − x)2 + µ o 2 , ˆ (13) where µ is a small positive constant that controls the cost-accuracy trade-off, akin to a regularisation parameter. Each neuron in the optimal network will seek to reduce this loss function by firing a spike. Specifically, the ith neuron will spike whenever L(no spike in i) > L(spike in i). This leads to the following spiking rule for the ith neuron: Vi > Ti (14) where Vi ≡ Γ(x − x) − µoi and Ti ≡ Γ2 /2 + µ/2. We can naturally interpret Vi as the membrane ˆ potential of the ith neuron and Ti as the spiking threshold of that neuron. As before, we can now derive membrane potential dynamics: ˙ V = −V + ΓT c − (ΓT Γ + µI)o, 2 (15) The read-out weights must scale as Γ ∼ 1/N so that firing rates are not unrealistically small in large networks. We can see this by calculating the average firing rate N oi /N ≈ x/(ΓN ) ∼ O(N/N ) ∼ O(1). i=1 ¯ 4 where I is the identity matrix and ΓT Γ + µI is the network connectivity. We can interpret the selfconnection terms {Γ2 +µ} as voltage resets that decrease the voltage of any neuron that spikes. This optimal network is equivalent to a network of identical integrate-and-fire neurons with homogeneous inhibitory connectivity. The network has some interesting dynamical properties. The voltages of all the neurons are largely synchronous, all increasing to the spiking threshold at about the same time3 (Fig. 1F). Nonetheless, neural spiking is asynchronous. The first neuron to spike will reset itself by Γ2 + µ, and it will inhibit all the other neurons in the network by Γ2 . This mechanism prevents neurons from spik- x 3 The first neuron to spike will be random if there is some membrane potential noise. V (A) (B) x x ˆ x 10 1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 1 D 0.5 V V 0 ˆ x V ˆ x (C) 1 0 1 2 0 0.625 25 25.625 (D) start of learning 1 V 50 200.625 400 400.625 1 2.4 O 1.78 ω 1.77 25 neuron$ 0 1 2 !me$ 3 4 25 1 5 V 400.625 !me$ (F) 25 1 2.35 1.05 1.049 400 25.625 !me$ (E) neuron$ 100.625 200 end of learning 1.4 1.35 ω 100 !me$ 1 V 1 O 50.625 0 1 2 !me$ 3 4 5 V !me$ !me$ Figure 1: Learning in a single neuron and a homogeneous network. (A) A single neuron represents an input signal x by producing an output x. (B) During learning, the single neuron output x (solid red ˆ ˆ line, top panel) converges towards the input x (blue). Similarly, for a homogeneous network the output x (dashed red line, top panel) converges towards x. Connectivity also converges towards optimal ˆ connectivity in both the single neuron case (solid black line, middle panel) and the homogeneous net2 2 work case (dashed black line, middle panel), as quantified by D = maxi,j ( Ωij − Ωopt / Ωopt ) ij ij at each point in time. Consequently, the membrane potential reset (bottom panel) converges towards the optimal reset (green line, bottom panel). Spikes are indicated by blue vertical marks, and are produced when the membrane potential reaches threshold (bottom panel). Here, we have rescaled time, as indicated, for clarity. (C) Our learning rule dictates that the autapse ω in our single neuron (bottom panel) changes in proportion to the membrane potential (top panel) and the firing rate (middle panel). (D) At the end of learning, the reset ω fluctuates weakly about the optimal value. (E) For a homogeneous network, neurons spike regularly at the start of learning, as shown in this raster plot. Membrane potentials of different neurons are weakly correlated. (F) At the end of learning, spiking is very irregular and membrane potentials become more synchronous. 5 ing synchronously. The population as a whole acts similarly to the single neuron in our previous example. Each neuron fires regularly, even if a different neuron fires in every integration cycle. The design of this optimal network requires the tuning of N (N − 1) synaptic parameters. How can an arbitrary network of integrate-and-fire neurons learn this optimum? As before, we address this question by using the optimal network as a target for learning. We start with an arbitrarily connected network of integrate-and-fire neurons: ˙ V = −V + ΓT c − Ωo, (16) where Ω is a matrix of connectivity weights, which includes the resets of the individual neurons. Assuming that learning occurs on a slow time scale, we can rewrite this equation as V = ΓT x − Ω¯ . o (17) Now, repeating the arguments from the single neuron derivation, we modify the loss function to obtain an online learning rule. Specifically, we set LV = V 2 /2, and calculate the gradient: ∂LV = ∂Ωij Vk k ∂Vk =− ∂Ωij Vk δki oj − ¯ k Vk Ωkl kl ∂ ol ¯ . ∂Ωij (18) We can simplify this equation considerably by observing that the contribution of the second summation is largely averaged out under a wide variety of realistic conditions4 . Therefore, it can be neglected, and we obtain the following local learning rule: ∂LV ˙ = V i oj . ¯ τ Ωij = − ∂Ωij (19) This is a Hebbian plasticity rule, whereby connectivity changes in proportion to the presynaptic firing rate oj and post-synaptic membrane potential Vi . We assume that the neural thresholds are set ¯ to a constant T and that the neural resets are set to their optimal values −T . In the previous section we demonstrated that these resets can be obtained by a Hebbian plasticity rule (Eqn. (10)). This learning rule minimises the difference between the read-out and the signal, by approaching the optimal recurrent connection strengths for the network (Fig. 1B). As in the single neuron case, learning does not stop, so the connection strengths fluctuate close to their optimal value. During learning, network activity becomes progressively more asynchronous as it progresses towards optimal connectivity (Fig. 1E, F). 3 Learning in the general case Now that we have developed the fundamental concepts underlying our learning rule, we can derive a learning rule for the more general case of a network of N arbitrarily connected leaky integrateand-fire neurons. Our goal is to understand how such networks can learn to optimally represent a ˙ J-dimensional signal x = (x1 , . . . , xJ ), using the read-out equation x = −x + Γo. We consider a network with the following membrane potential dynamics: ˙ V = −V + ΓT c − Ωo, (20) where c is a J-dimensional input. We assume that this input is related to the signal according to ˙ c = x + x. This assumption can be relaxed by treating the input as the control for an arbitrary linear dynamical system, in which case the signal represented by the network is the output of such a computation [11]. However, this further generalisation is beyond the scope of this work. As before, we need to identify the optimal recurrent connectivity so that we have a target for learning. Most generally, the optimal recurrent connectivity is Ωopt ≡ ΓT Γ + µI. The output kernels of the individual neurons, Γi , are given by the rows of Γ, and their spiking thresholds by Ti ≡ Γi 2 /2 + 4 From the definition of the membrane potential we can see that Vk ∼ O(1/N ) because Γ ∼ 1/N . Therefore, the size of the first term in Eqn. (18) is k Vk δki oj = Vi oj ∼ O(1/N ). Therefore, the second term can ¯ ¯ be ignored if kl Vk Ωkl ∂ ol /∂Ωij ¯ O(1/N ). This happens if Ωkl O(1/N 2 ) as at the start of learning. It also happens towards the end of learning if the terms {Ωkl ∂ ol /∂Ωij } are weakly correlated with zero mean, ¯ or if the membrane potentials {Vi } are weakly correlated with zero mean. 6 µ/2. With these connections and thresholds, we find that a network of integrate-and-fire neurons ˆ ¯ will produce spike trains in such a way that the loss function L = x − x 2 + µ o 2 is minimised, ˆ where the read-out is given by x = Γ¯ . We can show this by prescribing a greedy5 spike rule: o a spike is fired by neuron i whenever L(no spike in i) > L(spike in i) [11]. The resulting spike generation rule is Vi > Ti , (21) ˆ where Vi ≡ ΓT (x − x) − µ¯i is interpreted as the membrane potential. o i 5 Despite being greedy, this spiking rule can generate firing rates that are practically identical to the optimal solutions: we checked this numerically in a large ensemble of networks with randomly chosen kernels. (A) x1 … x … 1 1 (B) xJJ x 10 L 10 T T 10 4 6 8 1 Viii V D ˆˆ ˆˆ x11 xJJ x x F 0.5 0 0.4 … … 0.2 0 0 2000 4000 !me   (C) x V V 1 x 10 x 3 ˆ x 8 0 x 10 1 2 3 !me   4 5 4 0 1 4 0 1 8 V (F) Ρ(Δt)   E-­‐I  input   0.4 ˆ x 0 3 0 1 x 10 1.3 0.95 x 10 ˆ x 4 V (E) 1 x 0 end of learning 50 neuron neuron 50 !me   2 0 ˆ x 0 0.5 ISI  Δt     1 2 !me   4 5 4 1.5 1.32 3 2 0.1 Ρ(Δt)   x E-­‐I  input   (D) start of learning 0 2 !me   0 0 0.5 ISI  Δt   1 Figure 2: Learning in a heterogeneous network. (A) A network of neurons represents an input ˆ signal x by producing an output x. (B) During learning, the loss L decreases (top panel). The difference between the connection strengths and the optimal strengths also decreases (middle panel), as 2 2 quantified by the mean difference (solid line), given by D = Ω − Ωopt / Ωopt and the maxi2 2 mum difference (dashed line), given by maxi,j ( Ωij − Ωopt / Ωopt ). The mean population firing ij ij rate (solid line, bottom panel) also converges towards the optimal firing rate (dashed line, bottom panel). (C, E) Before learning, a raster plot of population spiking shows that neurons produce bursts ˆ of spikes (upper panel). The network output x (red line, middle panel) fails to represent x (blue line, middle panel). The excitatory input (red, bottom left panel) and inhibitory input (green, bottom left panel) to a randomly selected neuron is not tightly balanced. Furthermore, a histogram of interspike intervals shows that spiking activity is not Poisson, as indicated by the red line that represents a best-fit exponential distribution. (D, F) At the end of learning, spiking activity is irregular and ˆ Poisson-like, excitatory and inhibitory input is tightly balanced and x matches x. 7 How can we learn this optimal connection matrix? As before, we can derive a learning rule by minimising the cost function LV = V 2 /2. This leads to a Hebbian learning rule with the same form as before: ˙ τ Ωij = Vi oj . ¯ (22) Again, we assume that the neural resets are given by −Ti . Furthermore, in order for this learning rule to work, we must assume that the network input explores all possible directions in the J-dimensional input space (since the kernels Γi can point in any of these directions). The learning performance does not critically depend on how the input variable space is sampled as long as the exploration is extensive. In our simulations, we randomly sample the input c from a Gaussian white noise distribution at every time step for the entire duration of the learning. We find that this learning rule decreases the loss function L, thereby approaching optimal network connectivity and producing optimal firing rates for our linear decoder (Fig. 2B). In this example, we have chosen connectivity that is initially much too weak at the start of learning. Consequently, the initial network behaviour is similar to a collection of unconnected single neurons that ignore each other. Spike trains are not Poisson-like, firing rates are excessively large, excitatory and inhibitory ˆ input is unbalanced and the decoded variable x is highly unreliable (Fig. 2C, E). As a result of learning, the network becomes tightly balanced and the spike trains become asynchronous, irregular and Poisson-like with much lower rates (Fig. 2D, F). However, despite this apparent variability, the population representation is extremely precise, only limited by the the metabolic cost and the discrete nature of a spike. This learnt representation is far more precise than a rate code with independent Poisson spike trains [11]. In particular, shuffling the spike trains in response to identical inputs drastically degrades this precision. 4 Conclusions and Discussion In population coding, large trial-to-trial spike train variability is usually interpreted as noise [2]. We show here that a deterministic network of leaky integrate-and-fire neurons with a simple Hebbian plasticity rule can self-organise into a regime where information is represented far more precisely than in noisy rate codes, while appearing to have noisy Poisson-like spiking dynamics. Our learning rule (Eqn. (22)) has the basic properties of STDP. Specifically, a presynaptic spike occurring immediately before a post-synaptic spike will potentiate a synapse, because membrane potentials are positive immediately before a postsynaptic spike. Furthermore, a presynaptic spike occurring immediately after a post-synaptic spike will depress a synapse, because membrane potentials are always negative immediately after a postsynaptic spike. This is similar in spirit to the STDP rule proposed in [12], but different to classical STDP, which depends on post-synaptic spike times [9]. This learning rule can also be understood as a mechanism for generating a tight balance between excitatory and inhibitory input. We can see this by observing that membrane potentials after learning can be interpreted as representation errors (projected onto the read-out kernels). Therefore, learning acts to minimise the magnitude of membrane potentials. Excitatory and inhibitory input must be balanced if membrane potentials are small, so we can equate balance with optimal information representation. Previous work has shown that the balanced regime produces (quasi-)chaotic network dynamics, thereby accounting for much observed cortical spike train variability [13, 14, 4]. Moreover, the STDP rule has been known to produce a balanced regime [16, 17]. Additionally, recent theoretical studies have suggested that the balanced regime plays an integral role in network computation [15, 13]. In this work, we have connected these mechanisms and functions, to conclude that learning this balance is equivalent to the development of an optimal spike-based population code, and that this learning can be achieved using a simple Hebbian learning rule. Acknowledgements We are grateful for generous funding from the Emmy-Noether grant of the Deutsche Forschungsgemeinschaft (CKM) and the Chaire d’excellence of the Agence National de la Recherche (CKM, DB), as well as a James Mcdonnell Foundation Award (SD) and EU grants BACS FP6-IST-027140, BIND MECT-CT-20095-024831, and ERC FP7-PREDSPIKE (SD). 8 References [1] Tolhurst D, Movshon J, and Dean A (1982) The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res 23: 775–785. [2] Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci 18(10): 3870–3896. [3] Zohary E, Newsome WT (1994) Correlated neuronal discharge rate and its implication for psychophysical performance. Nature 370: 140–143. [4] Renart A, de la Rocha J, Bartho P, Hollender L, Parga N, Reyes A, & Harris, KD (2010) The asynchronous state in cortical circuits. Science 327, 587–590. [5] Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal firing in cortical microcircuits. Science 327: 584–587. [6] Okun M, Lampl I (2008) Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activities. Nat Neurosci 11, 535–537. [7] Shu Y, Hasenstaub A, McCormick DA (2003) Turning on and off recurrent balanced cortical activity. Nature 423, 288–293. [8] Gentet LJ, Avermann M, Matyas F, Staiger JF, Petersen CCH (2010) Membrane potential dynamics of GABAergic neurons in the barrel cortex of behaving mice. Neuron 65: 422–435. [9] Caporale N, Dan Y (2008) Spike-timing-dependent plasticity: a Hebbian learning rule. Annu Rev Neurosci 31: 25–46. [10] Boerlin M, Deneve S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7, e1001080. [11] Boerlin M, Machens CK, Deneve S (2012) Predictive coding of dynamic variables in balanced spiking networks. under review. [12] Clopath C, B¨ sing L, Vasilaki E, Gerstner W (2010) Connectivity reflects coding: a model of u voltage-based STDP with homeostasis. Nat Neurosci 13(3): 344–352. [13] van Vreeswijk C, Sompolinsky H (1998) Chaotic balanced state in a model of cortical circuits. Neural Comput 10(6): 1321–1371. [14] Brunel N (2000) Dynamics of sparsely connected networks of excitatory and inhibitory neurons. J Comput Neurosci 8, 183–208. [15] Vogels TP, Rajan K, Abbott LF (2005) Neural network dynamics. Annu Rev Neurosci 28: 357–376. [16] Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. (2011) Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science 334(6062):1569– 73. [17] Song S, Miller KD, Abbott LF (2000) Competitive Hebbian learning through spike-timingdependent synaptic plasticity. Nat Neurosci 3(9): 919–926. 9

4 0.51883799 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

Author: Guillermo Canas, Lorenzo Rosasco

Abstract: We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rate of empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures. 1 Introduction and Motivation In this paper we study the problem of learning from random samples a probability distribution supported on a manifold, when the learning error is measured using transportation metrics. The problem of learning a probability distribution is classic in statistics, and is typically analyzed for distributions in X = Rd that have a density with respect to the Lebesgue measure, with total variation, and L2 among the common distances used to measure closeness of two densities (see for instance [10, 32] and references therein.) The setting in which the data distribution is supported on a low dimensional manifold embedded in a high dimensional space has only been considered more recently. In particular, kernel density estimators on manifolds have been described in [36], and their pointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion on several topics related to statistics on a Riemannian manifold can be found in [26]. Interestingly, the problem of approximating measures with respect to transportation distances has deep connections with the fields of optimal quantization [14, 16], optimal transport [35] and, as we point out in this work, with unsupervised learning (see Sec. 4.) In fact, as described in the sequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but also others such as PCA and k-flats), can be shown to be performing exactly the task of estimating the data-generating measure in the sense of the 2-Wasserstein distance. This close relation between learning theory, and optimal transport and quantization seems novel and of interest in its own right. Indeed, in this work, techniques from the above three fields are used to derive the new probabilistic bounds described below. Our technical contribution can be summarized as follows: (a) we prove uniform lower bounds for the distance between a measure and estimates based on discrete sets (such as the empirical measure or measures derived from algorithms such as kmeans); (b) we provide new probabilistic bounds for the rate of convergence of empirical to population measures which, unlike existing probabilistic bounds, hold for a very large class of measures; 1 (c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means to the data measure. The structure of the paper is described at the end of Section 2, where we discuss the exact formulation of the problem as well as related previous works. 2 Setup and Previous work Consider the problem of learning a probability measure ρ supported on a space M, from an i.i.d. sample Xn = (x1 , . . . , xn ) ∼ ρn of size n. We assume M to be a compact, smooth d-dimensional manifold of bounded curvature, with C 1 metric and volume measure λM , embedded in the unit ball of a separable Hilbert space X with inner product ·, · , induced norm · , and distance d (for d instance M = B2 (1) the unit ball in X = Rd .) Following [35, p. 94], let Pp (M) denote the Wasserstein space of order 1 ≤ p < ∞: Pp (M) := x p dρ(x) < ∞ ρ ∈ P (M) : M of probability measures P (M) supported on M, with finite p-th moment. The p-Wasserstein distance 1/p Wp (ρ, µ) = inf [E X − Y p ] : Law(X) = ρ, Law(Y ) = µ (1) X,Y where the random variables X and Y are distributed according to ρ and µ respectively, is the optimal expected cost of transporting points generated from ρ to those generated from µ, and is guaranteed to be finite in Pp (M) [35, p. 95]. The space Pp (M) with the Wp metric is itself a complete separable metric space [35]. We consider here the problem of learning probability measures ρ ∈ P2 (M), where the performance is measured by the distance W2 . There are many possible choices of distances between probability measures [13]. Among them, Wp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp (M), a sequence (µi )i∈N of measures converges weakly to µ iff Wp (µi , µ) → 0 and their p-th order moments converge to that of µ. There are other distances, such as the L´ vy-Prokhorov, or the weak-* distance, that also metrize e weak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98], 1. “Wasserstein distances are rather strong, [...]a definite advantage over the weak-* distance”. 2. “It is not so difficult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances.” Wasserstein distances have been used to study the mixing and convergence of Markov chains [22], as well as concentration of measure phenomena [20]. To this list we would add the important fact that existing and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4) to compute a measure ρ that minimizes the distance W2 (ˆn , ρ ) to the empirical measure ρ n ρn := ˆ 1 δx , n i=1 i a fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by k-means to the population measure ρ. The most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the two (by H¨ lder’s inequality, p ≤ q ⇒ Wp ≤ Wq .) In particular, “results in W2 distance are usually o stronger, and more difficult to establish than results in W1 distance” [35, p. 95]. A discussion of p = ∞ would take us out of topic, since its behavior is markedly different. 2.1 Closeness of Empirical and Population Measures By the strong law of large numbers, the empirical measure converges almost surely to the population measure: ρn → ρ in the sense of the weak topology [34]. Since weak convergence and convergence ˆ in Wp plus convergence of p-th moments are equivalent in Pp (M), this means that, in the Wp sense, the empirical measure ρn converges to ρ, as n → ∞. A fundamental question is therefore how fast ˆ the rate of convergence of ρn → ρ is. ˆ 2 2.1.1 Convergence in expectation The rate of convergence of ρn → ρ in expectation has been widely studied in the past, resultˆ ing in upper bounds of order EW2 (ρ, ρn ) = O(n−1/(d+2) ) [19, 8], and lower bounds of order ˆ EW2 (ρ, ρn ) = Ω(n−1/d ) [29] (both assuming that the absolutely continuous part of ρ is ρA = 0, ˆ with possibly better rates otherwise). More recently, an upper bound of order EWp (ρ, ρn ) = O(n−1/d ) has been proposed [2] by proving ˆ a bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the expected distance EWp (ρ, ρn ). In particular, given two independent samples Xn , Yn , the OBM ˆ problem is that of finding a permutation σ that minimizes the matching cost n−1 xi −yσ(i) p [24, p ˆ ˆ ˆ 30]. It is not hard to show that the optimal matching cost is Wp (ˆXn , ρYn ) , where ρXn , ρYn are ρ the empirical measures associated to Xn , Yn . By Jensen’s inequality, the triangle inequality, and (a + b)p ≤ 2p−1 (ap + bp ), it holds EWp (ρ, ρn )p ≤ EWp (ˆXn , ρYn )p ≤ 2p−1 EWp (ρ, ρn )p , ˆ ρ ˆ ˆ and therefore a bound of order O(n−p/d ) for the OBM problem [2] implies a bound EWp (ρ, ρn ) = ˆ O(n−1/d ). The matching lower bound is only known for a special case: ρA constant over a bounded set of non-null measure [2] (e.g. ρA uniform.) Similar results, with matching lower bounds are found for W1 in [11]. 2.1.2 Convergence in probability Results for convergence in probability, one of the main results of this work, appear to be considerably harder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or Talagrand inequalities Tp , which can be used to prove concentration inequalities on Wp [20]. In particular, we say that ρ satisfies a Tp (C) inequality with C > 0 iff Wp (ρ, µ)2 ≤ CH(µ|ρ), ∀µ ∈ Pp (M), where H(·|·) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain probabilistic upper bounds on Wp (ρ, ρn ), with p = 1, 2, if ρ is known to satisfy a Tp inequality ˆ of the same order, thereby reducing the problem of bounding Wp (ρ, ρn ) to that of obtaining a Tp ˆ inequality. Note that, by Jensen’s inequality, and as expected from the behavior of Wp , the inequality T2 is stronger than T1 [20]. While it has been shown that ρ satisfies a T1 inequality iff it has a finite square-exponential moment 2 (E[eα x ] finite for some α > 0) [4, 7], no such general conditions have been found for T2 . As an example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the celebrated Csisz´ r-Kullback-Pinsker inequality [27], for all ρ, µ ∈ Pp (M), it is a Wp (ρ, µ)2p ≤ (2D)2p ρ − µ where · does not. TV 2 TV ≤ 22p−1 D2p H(µ|ρ), is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p ≥ 2 it The T2 inequality has been shown by Talagrand to be satisfied by the Gaussian distribution [31], and then slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as noted in [6], “contrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability or decay estimates.” Structure of this paper. In this work we obtain bounds in probability (learning rates) for the problem of learning a probability measure in the sense of W2 . We begin by establishing (lower) bounds for the convergence of empirical to population measures, which serve to set up the problem and introduce the connection between quantization and measure learning (sec. 3.) We then describe how existing unsupervised learning algorithms that compute a set (k-means, k-flats, PCA,. . . ) can be easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus here on k-means. Since the two measure estimates that we consider are the empirical measure, and the measure induced by k-means, we next set out to prove upper bounds on their convergence to the data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures, which are related to the problem of optimal quantization. The bounds apply in a very broad setting (unlike existing bounds based on transportation inequalities, they are not restricted to log-concave measures [20, 3].) 3 3 Learning probability measures, optimal transport and quantization We address the problem of learning a probability measure ρ when the only observations we have at our disposal are n i.i.d. samples Xn = (x1 , . . . , xn ). We begin by establishing some notation and useful intermediate results. Given a closed set S ⊆ X , let {Vq : q ∈ S} be a Borel Voronoi partition of X composed of sets Vq closest to each q ∈ S, that is, such that each Vq ⊆ {x ∈ X : x − q = minr∈S x − r } is measurable (see for instance [15].) Consider the projection function πS : X → S mapping each x ∈ Vq to q. By virtue of {Vq }q∈S being a Borel Voronoi partition, the map πS is measurable [15], and it is d (x, πS (x)) = minq∈S x − q for all x ∈ X . For any ρ ∈ Pp (M), let πS ρ be the pushforward, or image measure of ρ under the mapping πS , −1 which is defined to be (πS ρ)(A) := ρ(πS (A)) for all Borel measurable sets A. From its definition, it is clear that πS ρ is supported on S. We now establish a connection between the expected distance to a set S, and the distance between ρ and the set’s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance to S is exactly the expected quantization error Ep,ρ (S) := Ex∼ρ d(x, S)p = Ex∼ρ x − πS (x) p incurred when encoding points x drawn from ρ by their closest point πS (x) in S [14]. This close connection between optimal quantization and Wasserstein distance has been pointed out in the past in the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures. The following two lemmas are key tools in the reminder of the paper. The first highlights the close link between quantization and optimal transport. Lemma 3.1. For closed S ⊆ X , ρ ∈ Pp (M), 1 ≤ p < ∞, it holds Ex∼ρ d(x, S)p = Wp (ρ, πS ρ)p . Note that the key element in the above lemma is that the two measures in the expression Wp (ρ, πS ρ) must match. When there is a mismatch, the distance can only increase. That is, Wp (ρ, πS µ) ≥ Wp (ρ, πS ρ) for all µ ∈ Pp (M). In fact, the following lemma shows that, among all the measures with support in S, πS ρ is closest to ρ. Lemma 3.2. For closed S ⊆ X , and all µ ∈ Pp (M) with supp(µ) ⊆ S, 1 ≤ p < ∞, it holds Wp (ρ, µ) ≥ Wp (ρ, πS ρ). When combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is limited by the performance of the optimal quantization problem. For instance, Wp (ρ, ρn ) can only ˆ be, in the best-case, as low as the optimal quantization cost with codebook of size n. The following section makes this claim precise. 3.1 Lower bounds Consider the situation depicted in fig. 1, in which a sample X4 = {x1 , x2 , x3 , x4 } is drawn from a distribution ρ which we assume here to be absolutely continuous on its support. As shown, the projection map πX4 sends points x to their closest point in X4 . The resulting Voronoi decomposition of supp(ρ) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi regions have null ambient measure, and since ρ is absolutely continuous, the pushforward measure 4 can be written in this case as πX4 ρ = j=1 ρ(Vxj )δxj , where Vxj is the Voronoi region of xj . Note that, even for finite sets S, this particular decomposition is not always possible if the {Vq }q∈S form a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, ρ has an atom falling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and double-counting would imply q ρ(Vq ) > 1. The technicalities required to correctly define a Borel Voronoi partition are such that, in general, it is simpler to write πS ρ, even though (if S is discrete) this measure can clearly be written as a sum of deltas with appropriate masses. By lemma 3.1, the distance Wp (ρ, πX4 ρ)p is the (expected) quantization cost of ρ when using X4 as codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This reasoning leads to the following lower bound between empirical and population measures. 4 Theorem 3.3. For ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and 1 ≤ p < ∞, it holds Wp (ρ, ρn ) = Ω(n−1/d ) uniformly over ρn , where the constants depend on d and ρA only. ˆ ˆ Proof: Let Vn,p (ρ) := inf S⊂M,|S|=n Ex∼ρ d(x, S)p be the optimal quantization cost of ρ of order p with n centers. Since ρA = 0, and since ρ has a finite (p + δ)-th order moment, for some δ > 0 (since it is supported on the unit ball), then it is Vn,p (ρ) = Θ(n−p/d ), with constants depending on d and ρA (see [14, p. 78] and [16].) Since supp(ˆn ) = Xn , it follows that ρ Wp (ρ, ρn )p ˆ ≥ lemma 3.2 Wp (ρ, πXn ρ)p = lemma 3.1 Ex∼ρ d(x, Xn )p ≥ Vn,p (ρ) = Θ(n−p/d ) Note that the bound of theorem 3.3 holds for ρn derived from any sample Xn , and is therefore ˆ stronger than the existing lower bounds on the convergence rates of EWp (ρ, ρn ) → 0. In particular, ˆ it trivially induces the known lower bound Ω(n−1/d ) on the rate of convergence in expectation. 4 Unsupervised learning algorithms for learning a probability measure As described in [21], several of the most widely used unsupervised learning algorithms can be ˆ interpreted to take as input a sample Xn and output a set Sk , where k is typically a free parameter of the algorithm, such as the number of means in k-means1 , the dimension of affine spaces in PCA, n ˆ etc. Performance is measured by the empirical quantity n−1 i=1 d(xi , Sk )2 , which is minimized among all sets in some class (e.g. sets of size k, affine spaces of dimension k,. . . ) This formulation is general enough to encompass k-means and PCA, but also k-flats, non-negative matrix factorization, and sparse coding (see [21] and references therein.) Using the discussion of Sec. 3, we can establish a clear connection between unsupervised learning and the problem of learning probability measures with respect to W2 . Consider as a running example the k-means problem, though the argument is general. Given an input Xn , the k-means problem is ˆ ˆ to find a set |Sk | = k minimizing its average distance from points in Xn . By associating to Sk the pushforward measure πSk ρn , we find that ˆ ˆ 1 n n ˆ ˆ d(xi , Sk )2 = Ex∼ρn d(x, Sk )2 ˆ i=1 = lemma 3.1 W2 (ˆn , πSk ρn )2 . ρ ˆ ˆ (2) Since k-means minimizes equation 2, it also finds the measure that is closest to ρn , among those ˆ with support of size k. This connection between k-means and W2 measure approximation was, to the best of the authors’ knowledge, first suggested by Pollard [28] though, as mentioned earlier, the argument carries over to many other unsupervised learning algorithms. Unsupervised measure learning algorithms. We briefly clarify the steps involved in using an existing unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized algorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk (Xn ). The measure learning algorithm Ak : Mn → Pp (M) corresponding to Uk is defined as follows: ˆ 1. Ak takes a sample Xn and outputs the measure πSk ρn , supported on Sk = Uk (Xn ); ˆ ˆ 2. since ρn is discrete, then so must πSk ρn be, and thus Ak (Xn ) = ˆ ˆ ˆ 1 n n ˆ i=1 δπSk (xi ) ; 3. in practice, we can simply store an n-vector πSk (x1 ), . . . , πSk (xn ) , from which Ak (Xn ) ˆ ˆ can be reconstructed by placing atoms of mass 1/n at each point. In the case that Uk is the k-means algorithm, only k points and k masses need to be stored. Note that any algorithm A that attempts to output a measure A (Xn ) close to ρn can be cast in the ˆ above framework. Indeed, if S is the support of A (Xn ) then, by lemma 3.2, πS ρn is the measure ˆ closest to ρn with support in S . This effectively reduces the problem of learning a measure to that of ˆ 1 In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the k-means problem, even though in practice an approximation algorithm may be used. 5 finding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer (see [15], [12, p. 350], and [14, p. 37–38]) reduces the problem of finding an optimal quantizer to that of finding an optimal quantizing set. Clearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically ˆ ˆ non-increasing with k. In particular, since Sn = Xn and πSn ρn = ρn , it is Ex∼ρn d(x, Sn )2 = ˆ ˆ ˆ ˆ 2 W2 (ˆn , πSn ρn ) = 0. That is, we can always make the learned measure arbitrarily close to ρn ρ ˆ ˆ ˆ by increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned with minimizing the 2-Wasserstein distance W2 (ρ, πSk ρn ) to the data-generating measure. The ˆ ˆ actual performance of k-means is thus not necessarily guaranteed to behave in the same way as the empirical one, and the question of characterizing its behavior as a function of k and n naturally arises. ˆ Finally, we note that, while it is Ex∼ρn d(x, Sk )2 = W2 (ˆn , πSk ρn )2 (the empirical performances ρ ˆ ˆ ˆ are the same in the optimal quantization, and measure learning problem formulations), the actual performances satisfy ˆ Ex∼ρ d(x, Sk )2 = W2 (ρ, π ˆ ρ)2 ≤ W2 (ρ, π ˆ ρn )2 , 1 ≤ k ≤ n. ˆ lemma 3.1 Sk lemma 3.2 Sk Consequently, with the identification between sets S and measures πS ρn , the measure learning ˆ problem is, in general, harder than the set-approximation problem (for example, if M = Rd and ρ is absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is ˆ almost surely strict: Ex∼ρ d(x, Sk )2 < W2 (ρ, πSk ρn )2 for 1 < k < n.) ˆ ˆ In the remainder, we characterize the performance of k-means on the measure learning problem, for varying k, n. Although other unsupervised learning algorithms could have been chosen as basis for our analysis, k-means is one of the oldest and most widely used, and the one for which the deep connection between optimal quantization and measure approximation is most clearly manifested. Note that, by setting k = n, our analysis includes the problem of characterizing the behavior of the distance W2 (ρ, ρn ) between empirical and population measures which, as indicated in Sec. 2.1, ˆ is a fundamental question in statistics (i.e. the speed of convergence of empirical to population measures.) 5 Learning rates In order to analyze the performance of k-means as a measure learning algorithm, and the convergence of empirical to population measures, we propose the decomposition shown in fig. 2. The diagram includes all the measures considered in the paper, and shows the two decompositions used to prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the distance W2 (ρ, ρn ). This decomposition uses the measures πSk ρ and πSk ρn as intermediates to arrive ˆ ˆ at ρn , where Sk is a k-point optimal quantizer of ρ, that is, a set Sk minimizing Ex∼ρ d(x, S)2 over ˆ all sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2 (ρ, πSk ρn ) ˆ ˆ (the performance of k-means), whereas the labelled black arrows correspond to individual terms in the bounds. We begin with the (slightly) simpler of the two results. 5.1 Convergence rates for the empirical to population measures Let Sk be the optimal k-point quantizer of ρ of order two [14, p. 31]. By the triangle inequality and the identity (a + b + c)2 ≤ 3(a2 + b2 + c2 ), it follows that W2 (ρ, ρn )2 ≤ 3 W2 (ρ, πSk ρ)2 + W2 (πSk ρ, πSk ρn )2 + W2 (πSk ρn , ρn )2 . ˆ ˆ ˆ ˆ (3) This is the decomposition depicted in the upper arrow of fig. 2. By lemma 3.1, the first term in the sum of equation 3 is the optimal k-point quantization error of ρ over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown in the proof of theorem 5.1 (part a) to be of order Θ(k −2/d ). The remaining terms, b) and c), are slightly more technical and are bounded in the proof of theorem 5.1. Since equation 3 holds for all 1 ≤ k ≤ n, the best bound on W2 (ρ, ρn ) can be obtained by optimizˆ ing the right-hand side over all possible values of k, resulting in the following probabilistic bound for the rate of convergence of the empirical to population measures. 6 x2 x W2 (ρ, ρn ) ˆ supp ρ x1 π{x1 ,x2 ,x3 ,x4 } ρ a) x3 πSk ρ b) πSk ρn ˆ c) d) ρn ˆ πSk ρn ˆ ˆ W2 (ρ, πSk ρn ) ˆ ˆ x4 Figure 1: A sample {x1 , x2 , x3 , x4 } is drawn from a distribution ρ with support in supp ρ. The projection map π{x1 ,x2 ,x3 ,x4 } sends points x to their closest one in the sample. The induced Voronoi tiling is shown in shades of blue. Figure 2: The measures considered in this paper are linked by arrows for which upper bounds for their distance are derived. Bounds for the quantities of interest W2 (ρ, ρn )2 , and W2 (ρ, πSk ρn )2 , ˆ ˆ ˆ are decomposed by following the top and bottom colored arrows. Theorem 5.1. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, sufficiently large n, and τ > 0, it holds W2 (ρ, ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ where m(ρA ) := 5.2 M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Learning rates of k-means The key element in the proof of theorem 5.1 is that the distance between population and empirical measures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate size k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means is close to an optimal quantizer (for instance if sufficient data is available), then we would similarly expect that the best bounds for k-means correspond to a choice of k < n. The decomposition of the bottom (blue) arrow in figure 2 leads to the following bound in probability. Theorem 5.2. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and τ > 0, then for all sufficiently large n, and letting k = C · m(ρA ) · nd/(2d+4) , it holds W2 (ρ, πSk ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ ˆ where m(ρA ) := M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Note that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear ˆ surprising, it stems from the following fact. Since S = Sk is a minimizer of W2 (πS ρn , ρn )2 , the ˆ ˆ bound d) of figure 2 satisfies: W2 (πSk ρn , ρn )2 ≤ W2 (πSk ρn , ρn )2 ˆ ˆ ˆ ˆ ˆ and therefore (by the definition of c), the term d) is of the same order as c). It follows then that adding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since d) is the term that takes the output measure of k-means to the empirical measure, this implies that the rate of convergence of k-means (for suitably chosen k) cannot be worse than that of ρn → ρ. ˆ Conversely, bounds for ρn → ρ are obtained from best rates of convergence of optimal quantizers, ˆ whose convergence to ρ cannot be slower than that of k-means (since the quantizers that k-means produces are suboptimal.) 7 Since the bounds obtained for the convergence of ρn → ρ are the same as those for k-means with ˆ k of order k = Θ(nd/(2d+4) ), this suggests that estimates of ρ that are as accurate as those derived from an n point-mass measure ρn can be derived from k point-mass measures with k ˆ n. Finally, we note that the introduced bounds are currently limited by the statistical bound sup |W2 (πS ρn , ρn )2 − W2 (πS ρ, ρ)2 | ˆ ˆ |S|=k = sup |Ex∼ρn d(x, S)2 − Ex∼ρ d(x, S)2 | ˆ lemma 3.1 |S|=k (4) (see for instance [21]), for which non-matching lower bounds are known. This means that, if better upper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would automatically improve (would become closer to the lower bound.) References [1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259–264, 1984. [2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical Report arXiv:1103.2734, Mar 2011. [3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203–224, 2003. [4] S. G. Bobkov and F. G¨ tze. Exponential integrability and transportation cost related to logarithmic o Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, April 1999. [5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1wasserstein distance. Electron. J. Probab., 16(83):2296–2333, 2011. [6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3):541–593, 2007. [7] F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transportaa tion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. [8] Claire Caillerie, Fr´ d´ ric Chazal, J´ rˆ me Dedecker, and Bertrand Michel. Deconvolution for the Wassere e eo stein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011. [9] Kenneth L. Clarkson. Building triangulations using -nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM. [10] Luc Devroye and G´ bor Lugosi. Combinatorial methods in density estimation. Springer Series in Statisa tics. Springer-Verlag, New York, 2001. [11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995. [12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1992. [13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. [14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000. [15] Siegfried Graf, Harald Luschgy, and Gilles Page`. Distortion mismatch in the quantization of probability s measures. Esaim: Probability and Statistics, 12:127–153, 2008. [16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002. [17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007. [18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymptotic results. J. Math. Imaging Vis., 34(3):235–239, July 2009. [19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math., 55(3):261–273, November 1994. [20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001. [21] A. Maurer and M. Pontil. K–dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010. [22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810–864, 2009. 8 [23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1375–1382. 2009. [24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf. on Communication, Control and Computing, pages 368–378, 1978. [25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297– 304, 2005. [26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127–154, July 2006. [27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco: Holden-Day, 1964. [28] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–204, 1982. [29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1991. [30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1997. [31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional Analysis, 6:587–600, 1996. [32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, 1996. [34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy¯ : The Indian Journal a of Statistics, 19(1/2):23–26, Feb. 1958. [35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer, 2009. [36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing Systems 22, pages 849–856. 2003. 9

5 0.50383288 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

6 0.4423382 201 nips-2012-Localizing 3D cuboids in single-view images

7 0.43181735 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

8 0.4310908 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity

9 0.43000543 210 nips-2012-Memorability of Image Regions

10 0.42921504 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

11 0.4289771 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.4266555 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

13 0.42599604 303 nips-2012-Searching for objects driven by context

14 0.42439976 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

15 0.4223288 193 nips-2012-Learning to Align from Scratch

16 0.4220283 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

17 0.4213416 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

18 0.42110509 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

19 0.42038077 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

20 0.41918287 256 nips-2012-On the connections between saliency and tracking