nips nips2009 nips2009-188 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum
Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. [sent-5, score-0.361]
2 We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. [sent-6, score-0.221]
3 As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. [sent-7, score-0.675]
4 1 Introduction People appear to make rational statistical inferences from noisy, uncertain input in a wide variety of perceptual and cognitive domains [1, 9]. [sent-8, score-0.423]
5 A variety of psychological phenomena have natural interpretations in terms of Monte Carlo methods, such as resource limitations [4], stochastic responding [6, 23] and order effects [21, 14]. [sent-16, score-0.143]
6 Many problems in perception and cognition, however, 1 require inference in high dimensional models with sparse and noisy observations, where the correct global interpretation can only be achieved by propagating constraints from the ambiguous local information across the model. [sent-18, score-0.238]
7 Our goal in this paper is to explore the prospects for rational process models of perceptual inference based on MCMC. [sent-20, score-0.411]
8 MCMC algorithms are quite flexible, suitable for a wide range of approximate inference problems that arise in cognition, but with a particularly long history of application in visual inference problems ([8] and many subsequent papers). [sent-22, score-0.225]
9 Here we show that the characteristic dynamics of MCMC inference in high-dimensional, sparsely coupled spatial models correspond to several well-known phenomena in visual perception, specifically the dynamics of multistable percepts. [sent-24, score-0.653]
10 Perceptual multistability [13] has long been of interest both phenomenologically and theoretically for models of perception as Bayesian inference [7, 20, 22, 10]. [sent-25, score-0.459]
11 The classic example of perceptual multistability is the Necker cube, a 2D line drawing of a cube perceived to alternate between two different depth configurations (Figure 1A). [sent-26, score-0.62]
12 Bayesian modelers [7, 20, 22, 10] have interpreted these multistability phenomena as reflections of the shape of the posterior distribution arising from ambiguous observations, images that could have plausibly been generated by two or more distinct scenes. [sent-30, score-0.483]
13 For the Necker cube, two plausible depth configurations have indistinguishable 2D projections; with binocular rivalry, two mutually exclusive visual inputs have equal perceptual fidelity. [sent-31, score-0.576]
14 Under these conditions, the posterior over scene interpretations is bimodal, and rivalry is thought to reflect periodic switching between the modes. [sent-32, score-0.631]
15 Exactly how this “mode-switching” relates to the mechanisms by which the brain implements Bayesian perceptual inference is less clear, however. [sent-33, score-0.333]
16 Here we explore the hypothesis that the dynamics of multistability can be understood in terms of the output of an MCMC algorithm, drawing posterior samples in spatially structured probabilistic models for image interpretation. [sent-34, score-0.562]
17 Our proposal is most closely related to the work of Sundareswara and Schrater [20, 22], who suggested that mode-switching in Necker cubetype images reflects a rational sampling-based algorithm for approximate Bayesian inference and decision making. [sent-38, score-0.262]
18 As in most applications of Bayesian inference in machine vision [8, 16], we do not assume that the visual system has access to the full posterior distribution over scene interpretations, 2 Figure 1: (A) Necker cube. [sent-41, score-0.309]
19 (C) Markov random field image model with lattice and ring (D) topologies. [sent-43, score-0.136]
20 Shaded nodes correspond to observed variables; unshaded nodes correspond to hidden variables. [sent-44, score-0.162]
21 The visual system might be able to evaluate only relative probabilities of two similar hypotheses (as in Metropolis-Hastings), or to compute local conditional posteriors of one scene variable conditioned on its neighbors (as in Gibbs sampling). [sent-46, score-0.176]
22 We also do not make extra assumptions about weighting samples based on memory decay, or require that conscious perceptual decisions be based on a memory for samples; consciousness has access to only the current state of the Markov chain, reflecting the observer’s current brain state. [sent-47, score-0.287]
23 Here we show that several characteristic phenomena of multistability derive naturally from applying standard MCMC inference to Markov random fields (MRFs) – high dimensional, loosely coupled graphical models with spatial structure characteristic of many low-level and mid-level vision problems. [sent-48, score-0.58]
24 2 Markov random field image model Our starting point is a simple and schematic model of vision problems embodying the idea that images are generated by a set of hidden variables with local dependencies. [sent-51, score-0.155]
25 , lattice or ring) can be used to capture the structure of different visual objects (Figure 1C,D). [sent-57, score-0.134]
26 We construct the likelihood potential R(xi |zi ) to express the ambiguity of the image by making it multimodal: several different hidden causes are equally likely to have generated the image. [sent-61, score-0.135]
27 (3) The computational problem for vision (as we are framing it) is to infer the hidden causes of an observed image. [sent-63, score-0.15]
28 Given an observed image x, the posterior distribution over hidden causes z is P (z|x) = P (x|z)P (z) . [sent-64, score-0.21]
29 3 Markov chain monte carlo The basic idea behind Monte Carlo methods is to approximate a distribution with a set of samples drawn from that distribution. [sent-69, score-0.198]
30 MCMC methods address this problem by drawing samples from a Markov chain that converges to the posterior distribution [16]. [sent-71, score-0.157]
31 In the proposal stage, a new state z is proposed by generating a random sample from a proposal density Q z ; z(l) that depends on the current state. [sent-76, score-0.152]
32 In the acceptance stage, this proposal is accepted with probability P z(l+1) = z |z(l) = min 1, P (z |x) P z(l) |x , (5) where we have assumed for simplicity that the proposal is symmetric: Q(z ; z) = Q(z; z ). [sent-77, score-0.152]
33 4 Results We now show how the Metropolis algorithm applied to the MRF image model gives rise to a number of phenomena in binocular rivalry experiments. [sent-79, score-0.781]
34 2 to compensate for the fewer neighbors around each node as compared to the lattice topology. [sent-84, score-0.185]
35 1 Distribution of dominance durations One of the most robust findings in the literature on perceptual multistability is that switching times in binocular rivalry between different stable percepts tend to follow a Gamma-like distribution. [sent-91, score-1.786]
36 In other words, the “dominance” durations of stability in one mode tend to be neither overwhelmingly short nor long. [sent-92, score-0.188]
37 This effect is so characteristic of binocular rivalry that there have been countless psychophysical experiments measuring the differences in Gamma switching time parameters across manipulations, and testing whether Gamma, or log-normal distributions are best [2]. [sent-93, score-0.749]
38 To account for this characteristic behavior, many papers have described neural circuits that could produce switching oscillations with the right stochastic dynamics (e. [sent-94, score-0.223]
39 Existing rational process models of multistability [7, 20, 22] likewise appeal to specific implementational-level constraints to produce 4 Figure 2: (A) Simulated timecourse of bistability in the lattice MRF. [sent-97, score-0.603]
40 The horizontal lines show the thresholds for a perceptual switch. [sent-99, score-0.225]
41 (B) Distribution of simulated dominance durations (mean-normalized) for MRF with lattice topology. [sent-100, score-0.479]
42 Curves show gamma distributions fitted to simulated (with parameter values shown on the right) and empirical data, replotted from [17] this effect. [sent-101, score-0.137]
43 In contrast, here we show how Gamma-distributed dominance durations fall naturally out of MCMC operating on an MRF. [sent-102, score-0.375]
44 We constructed a 4 × 4 grid to model a typical binocular rivalry grating. [sent-103, score-0.66]
45 In the typical experiment reporting a Gamma distribution of dominance durations, subjects are asked to say which of two images corresponds to their “global” percept. [sent-104, score-0.232]
46 To make the same query of the current state of our simulated MCMC chain, we defined a perceptual switch to occur when at least 2/3 of the hidden nodes turn positive or negative. [sent-105, score-0.546]
47 Figure 2A shows a sample of the timecourse1 and the distribution of dominance durations and maximum-likelihood estimates for the Gamma parameters α (shape) and β (scale), demonstrating that the durations produced by MCMC are well-described by a Gamma distribution (Figure 2B). [sent-106, score-0.518]
48 The Gamma distribution may arise in MCMC on an MRF because each hidden node takes an exponentially-distributed amount of time to switch (and these switches follow roughly one after another). [sent-108, score-0.323]
49 In these settings, the total amount of time until enough nodes switch to one mode will be Gamma-distributed (i. [sent-109, score-0.206]
50 Since multiple samples may correspond to the same percept, a particular percept will lose dominance only when the weights on all such samples decrease below the weights on samples of the non-dominant percept. [sent-114, score-0.526]
51 By assuming an exponential decay on the weights, the time it takes for a single sample to lose dominance will be approximately exponentially distributed, leading to a Gamma distribution on the time it takes for multiple samples of the same percept to lose dominance. [sent-115, score-0.532]
52 Here we have attempted to capture this effect within a rational inference procedure by attributing the exponential dynamics to the operation of MCMC on individual nodes in the MRF, rather than a memory decay process on individual samples. [sent-116, score-0.397]
53 2 Contextual biases Much discussion in research on multistability revolves around the extent to which it is influenced by top-down processes like prior knowledge and attention [2]. [sent-118, score-0.331]
54 This is not the phenomenology of bistability in a Necker cube, but it is the phenomenology of binocular rivlary with grating-like stimuli where experiments have shown that substantial time is spent in transition periods [3]. [sent-120, score-0.501]
55 To capture the perception of depth in the Necker cube, or rivalry with more complex higher-level stimuli (like natural scenes), a more complex and densely interconnected graphical model would be required — in such cases the perceptual switching dynamics will be different. [sent-122, score-1.04]
56 On the top are the standard tilted grating patches presented dichoptically. [sent-124, score-0.136]
57 On the bottom are the tilted grating patches superimposed on a background of rightward-tilting gratings, a contextual cue that biases dominance towards the rightward-tilting grating patch. [sent-125, score-0.49]
58 (B) Simulated timecourse of transient preference for a lattice-topology MRF with and without a contextual cue (averaged over 100 runs of the sampler). [sent-126, score-0.195]
59 (C) Empirical timecourse of transient preference fitted with a scaled cumulative Gaussian function, reprinted with permission from [17]. [sent-127, score-0.229]
60 influences, several studies have shown that contextual cues can bias the relative dominance of rival stimuli. [sent-128, score-0.332]
61 For example, [5] superimposed rivalrous tilted grating patches on a background of either rightward or leftward tilting gratings (Figure 3A) and showed that the direction of background tilt shifted dominance towards the monocular stimulus with context-compatible tilt. [sent-129, score-0.578]
62 ” Thus, human perceptual inference may similarly require an initial burn-in period to reach the stationary distribution. [sent-136, score-0.3]
63 3 Deviations from stable rivalry: fusion Most models have focused on the “stable” portions of the bistable dynamics of rivalry; however, in addition to the mode-hopping behavior that characterizes this phenomenon, bistable percepts often produce other states. [sent-138, score-0.495]
64 In some conditions the two percepts are known to fuse, rather than rival: the percept then becomes a composite or superposition of the two stimuli (and hence no alternation is perceived). [sent-139, score-0.406]
65 This fused perceptual state can be induced most reliably by decreasing the distance in feature space between the two stimuli [11] (Figure 4B) or decreasing the contrast of both stimuli [15]. [sent-140, score-0.533]
66 Neither neural, nor algorithmic, nor computational models of rivalry have thus far attempted to explain these findings. [sent-142, score-0.416]
67 We define such a fused percept as a perceptual state lying between the two “bistable” modes — that is, an interpretation between the two rivalrous, high-probability interpretations. [sent-144, score-0.559]
68 By making the modes closer together or increasing the variance around the modes, greater probability mass is assigned to an intermediate zone between the modes—a fused percept. [sent-147, score-0.202]
69 Thus, manipulating stimulus separation (feature distance) or stimulus fidelity (contrast) changes the parameterizations of the likelihood function, and these manipulations produce systematically increasing odds of fused percepts, matching the phenomenology of these stimuli (Figure 4B). [sent-148, score-0.38]
70 6 Figure 4: (A) Schematic illustration of manipulating orientation (feature space distance) and contrast in binocular rivalry stimuli. [sent-149, score-0.66]
71 (B) Experimental effects of increasing feature space distance (depth and color difference) between rivalrous gratings on exclusivity of monocular percepts, reprinted with permission from [11]. [sent-151, score-0.29]
72 Increasing the distance in feature space between rivalrous stimuli (C) or the contrast of both stimuli (D), modeled as increasing the variance around the modes, increases the probability of observing an exclusive percept in simulations. [sent-152, score-0.506]
73 4 Traveling waves Fused percepts are not the only deviations from bistability. [sent-154, score-0.274]
74 In other circumstances, particularly in binocular rivalry, stability is often incomplete across the visual field, producing “piecemeal” rivalry, in which one portion of the visual field looks like the image in one eye, while another portion looks like the image in the other eye. [sent-155, score-0.464]
75 These traveling waves reveal an interesting local dynamics during an individual switch itself, rather than just the Gamma-distributed dynamics of the time between complete switches of dominance. [sent-157, score-0.746]
76 Like fused percepts, these intraswitch dynamics have been generally ignored by models of multistability. [sent-158, score-0.224]
77 Demonstrating the dynamics of traveling waves within patches of the percept requires a different method of probing perception. [sent-159, score-0.612]
78 They found that the temporal delay in the response increased as a function of cortical distance from the V1 representation of the top of the annulus (Figure 5C). [sent-166, score-0.149]
79 To simulate such traveling waves within the percept of a stimulus, we constructed an MRF with ring topology and measured the propagation time (the time at which a mode-switch occurs) at different hidden nodes along the ring. [sent-167, score-0.646]
80 Consistent with the idea of wave propagation, Figure 5D shows the average time for a simulated node to switch modes as a function of distance around the ring. [sent-169, score-0.443]
81 Intuitively, nodes will tend to switch in a kind of “domino effect” around the ring; the local dependencies in the MRF ensure that nodes will be more likely to switch modes once their neighbors have switched. [sent-170, score-0.604]
82 Thus, once a switch at one node has been accepted by the Metropolis algorithm, a switch at its neighbor is likely to follow. [sent-171, score-0.369]
83 5 Conclusion We have proposed a “rational process” model of perceptual multistability based on the idea that humans approximate the posterior distribution over the hidden causes of their visual input with a set of samples. [sent-172, score-0.764]
84 In particular, the dynamics of the sample-generating process gives rise to much of 7 Figure 5: Traveling waves in binocular rivalry. [sent-173, score-0.509]
85 (left and center panels) and the subject percept reported by observers (right panel), in which the low contrast stimulus was seen to spread around the annulus, starting at the top. [sent-175, score-0.274]
86 (D) A transient increase in contrast of the suppressed stimulus induces a perceptual switch at the location of contrast change. [sent-181, score-0.551]
87 The propagation time for a switch at a probe location increases with distance (around the annulus) from the switch origin. [sent-182, score-0.4]
88 the rich dynamics in multistable perception observed experimentally. [sent-183, score-0.338]
89 These dynamics may be an approximation to the MCMC algorithms standardly used to solve difficult inference problems in machine learning and statistics [16]. [sent-184, score-0.209]
90 The idea that perceptual multistability can be construed in terms of sampling in a Bayesian model was first proposed by [20, 22], and our work follows theirs closely in several respects. [sent-185, score-0.514]
91 Our goal here is to show how some of the basic phenomena of multistable perception can be understood straightforwardly as the output of familiar, simple and effective methods for approximate inference in Bayesian machine vision. [sent-187, score-0.365]
92 Thus, our contribution is to show how an inference algorithm widely used in statistics and computer science can give rise naturally to perceptual multistability phenomena. [sent-189, score-0.589]
93 Another important direction in which to extend this work is from rivalry with low-level stimuli to more complex vision problems that involve global coherence over the image (such as in natural scenes). [sent-193, score-0.59]
94 Although similar perceptual dynamics have been observed with a wide range of ambiguous stimuli, the absence of obvious transition periods with the Necker cube suggests that these dynamics may differ in important ways from perception of rivalry stimuli. [sent-194, score-1.111]
95 The time course of binocular rivalry reveals a fundamental role of noise. [sent-213, score-0.66]
96 Contradictory influence of context on predominance during binocular rivalry. [sent-229, score-0.244]
97 Traveling waves of activity in primary visual cortex during binocular rivalry. [sent-274, score-0.45]
98 Failure of rivalry at low contrast: evidence of a suprathreshold binocular summation process. [sent-298, score-0.66]
99 Perceptual multistability predicted by search model for Bayesian decisions. [sent-343, score-0.289]
100 Minimal physiological conditions for binocular rivalry and rivalry memory. [sent-364, score-1.076]
wordName wordTfidf (topN-words)
[('rivalry', 0.416), ('multistability', 0.289), ('binocular', 0.244), ('dominance', 0.232), ('perceptual', 0.225), ('percept', 0.174), ('mcmc', 0.17), ('switch', 0.16), ('necker', 0.145), ('traveling', 0.143), ('percepts', 0.143), ('durations', 0.143), ('mrf', 0.138), ('dynamics', 0.134), ('waves', 0.131), ('rational', 0.111), ('annulus', 0.109), ('bistable', 0.109), ('multistable', 0.109), ('perception', 0.095), ('fused', 0.09), ('stimuli', 0.089), ('phenomena', 0.086), ('proposal', 0.076), ('posterior', 0.075), ('inference', 0.075), ('visual', 0.075), ('cube', 0.074), ('bistability', 0.072), ('rivalrous', 0.072), ('timecourse', 0.072), ('hidden', 0.07), ('modes', 0.07), ('contextual', 0.064), ('gamma', 0.06), ('zi', 0.059), ('lattice', 0.059), ('transient', 0.059), ('grating', 0.058), ('stimulus', 0.058), ('carlo', 0.058), ('monte', 0.058), ('interpretations', 0.057), ('metropolis', 0.054), ('permission', 0.054), ('chain', 0.053), ('bayesian', 0.053), ('cognition', 0.052), ('gratings', 0.051), ('cognitive', 0.05), ('vision', 0.05), ('suppressed', 0.049), ('node', 0.049), ('switching', 0.049), ('phenomenology', 0.048), ('tilted', 0.048), ('nodes', 0.046), ('simulated', 0.045), ('tend', 0.045), ('markov', 0.045), ('switches', 0.044), ('reprinted', 0.044), ('around', 0.042), ('ring', 0.042), ('vul', 0.041), ('distance', 0.04), ('characteristic', 0.04), ('eld', 0.04), ('propagation', 0.04), ('inferences', 0.037), ('wave', 0.037), ('manipulations', 0.037), ('brascamp', 0.036), ('rival', 0.036), ('schrater', 0.036), ('sundareswara', 0.036), ('zci', 0.036), ('phenomenon', 0.036), ('propagating', 0.035), ('image', 0.035), ('patch', 0.035), ('neighbors', 0.035), ('grif', 0.034), ('wilson', 0.034), ('scene', 0.034), ('brain', 0.033), ('lose', 0.033), ('multimodal', 0.033), ('ambiguous', 0.033), ('hypotheses', 0.032), ('replotted', 0.032), ('piecemeal', 0.032), ('spends', 0.032), ('annular', 0.032), ('depth', 0.032), ('decay', 0.031), ('causes', 0.03), ('patches', 0.03), ('samples', 0.029), ('monocular', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum
Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1
2 0.10451233 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
Author: Peter Carbonetto, Matthew King, Firas Hamze
Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1
3 0.0991605 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing
Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz
Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1
4 0.088305965 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
Author: Adam Sanborn, Nick Chater, Katherine A. Heller
Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1
Author: Ed Vul, George Alvarez, Joshua B. Tenenbaum, Michael J. Black
Abstract: Multiple object tracking is a task commonly used to investigate the architecture of human visual attention. Human participants show a distinctive pattern of successes and failures in tracking experiments that is often attributed to limits on an object system, a tracking module, or other specialized cognitive structures. Here we use a computational analysis of the task of object tracking to ask which human failures arise from cognitive limitations and which are consequences of inevitable perceptual uncertainty in the tracking task. We find that many human performance phenomena, measured through novel behavioral experiments, are naturally produced by the operation of our ideal observer model (a Rao-Blackwelized particle filter). The tradeoff between the speed and number of objects being tracked, however, can only arise from the allocation of a flexible cognitive resource, which can be formalized as either memory or attention. 1
6 0.086674444 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
7 0.086400092 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
8 0.083547555 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
9 0.07798367 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
10 0.070327029 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion
11 0.064988106 133 nips-2009-Learning models of object structure
12 0.064747714 124 nips-2009-Lattice Regression
13 0.062180344 151 nips-2009-Measuring Invariances in Deep Networks
14 0.061902247 164 nips-2009-No evidence for active sparsification in the visual cortex
15 0.061861355 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection
16 0.061069079 100 nips-2009-Gaussian process regression with Student-t likelihood
17 0.060420047 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
18 0.060408421 43 nips-2009-Bayesian estimation of orientation preference maps
19 0.060400855 187 nips-2009-Particle-based Variational Inference for Continuous Systems
20 0.059426341 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities
topicId topicWeight
[(0, -0.184), (1, -0.14), (2, 0.029), (3, -0.054), (4, 0.025), (5, 0.004), (6, 0.074), (7, 0.038), (8, -0.044), (9, -0.096), (10, 0.002), (11, -0.003), (12, 0.051), (13, -0.042), (14, 0.096), (15, 0.007), (16, -0.014), (17, 0.036), (18, -0.044), (19, -0.042), (20, -0.07), (21, 0.017), (22, 0.044), (23, -0.029), (24, 0.08), (25, -0.01), (26, -0.021), (27, 0.094), (28, -0.076), (29, 0.027), (30, -0.004), (31, -0.042), (32, -0.104), (33, -0.034), (34, -0.09), (35, 0.012), (36, 0.119), (37, -0.1), (38, -0.084), (39, -0.035), (40, -0.066), (41, -0.004), (42, -0.031), (43, -0.054), (44, -0.116), (45, 0.04), (46, -0.055), (47, 0.046), (48, -0.055), (49, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.93715358 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum
Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1
2 0.63094175 152 nips-2009-Measuring model complexity with the prior predictive
Author: Wolf Vanpaemel
Abstract: In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model’s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to fit a wide range of outcomes. The PPC then measures how wide this range exactly is. keywords: Model Selection & Structure Learning; Model Comparison Methods; Perception The recent revolution in model selection methods in the cognitive sciences was driven to a large extent by the observation that computational models can differ in their complexity. Differences in complexity put models on unequal footing when their ability to approximate empirical data is assessed. Therefore, models should be penalized for their complexity when their adequacy is measured. The balance between descriptive adequacy and complexity has been termed generalizability [1, 2]. Much attention has been devoted to developing, advocating, and comparing different measures of generalizability (for a recent overview, see [3]). In contrast, measures of complexity have received relatively little attention. The aim of the current paper is to propose and illustrate a stand alone measure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on the intuitive idea that a complex model can predict many outcomes and a simple model can predict a few outcomes only. First, I discuss existing approaches to measuring model complexity and note some of their limitations. In particular, I argue that currently existing measures ignore one important aspect of a model: the prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the existing measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the complexities of two popular models of information integration. 1 Previous approaches to measuring model complexity A first approach to assess the (relative) complexity of models relies on simulated data. Simulationbased methods differ in how these artificial data are generated. A first, atheoretical approach uses random data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically ∗ I am grateful to Michael Lee and Liz Bonawitz. 1 interesting functions, such as the exponential or the logistic function [4]. Using these approaches, the models under consideration are equally complex if each model provides the best optimal fit to roughly the same number of data sets. A final approach to generating artificial data is a theoretical one, in which the data are generated from the models of interest themselves [6, 7]. The parameter sets used in the generation can either be hand-picked by the researcher, estimated from empirical data or drawn from a previously specified distribution. If the models under consideration are equally complex, each model should provide the best optimal fit to self-generated data more often than the other models under consideration do. One problem with this simulation-based approach is that it is very labor intensive. It requires generating a large amount of artificial data sets, and fitting the models to all these data sets. Further, it relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For example, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in the theoretical approach, results are heavily influenced by the parameter values used in generating the data. If they are fixed, on what basis? If they are estimated from empirical data, from which data? If they are drawn randomly, from which distribution? Further, a simulation study only gives a rough idea of complexity differences but provides no direct measure reflecting the complexity. A number of proposals have been made to measure model complexity more directly. Consider a model M with k parameters, summarized in the parameter vector θ = (θ1 , θ2 , . . . , θk , ) which has a range indicated by Ω. Let d denote the data and p(d|θ, M ) the likelihood. The most straightforward measure of model complexity is the parametric complexity (PC), which simply counts the number of parameters: PC = k. (1) PC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a direct and well understood relation toward complexity: the more parameters, the more complex the model. It is included as the complexity term of several generalizability measures such as AIC [8] and BIC [9], and it is at the heart of the Likelihood Ratio Test. Despite this intuitive appeal, PC is not free from problems. One problem with PC is that it reflects only a single aspect of complexity. Also the parameter range and the functional form (the way the parameters are combined in the model equation) influence a model’s complexity, but these dimensions of complexity are ignored in PC [2, 6]. A complexity measure that takes these three dimensions into account is provided by the geometric complexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is conceptualized as the number of distinguishable probability distributions a model can generate. It is defined by GC = k n ln + ln 2 2π det I(θ|M )dθ, (2) Ω where n indicates the size of the data sample and I(θ) is the Fisher Information Matrix: Iij (θ|M ) = −Eθ ∂ 2 ln p(d|θ, M ) . ∂θi ∂θj (3) Note that I(θ|M ) is determined by the likelihood function p(d|θ, M ), which is in turn determined by the model equation. Hence GC is sensitive to the number of parameters (through k), the functional form (through I), and the range (through Ω). Quite surprisingly, GC turns out to be equal to the complexity term used in one version of Minimum Description Length (MDL), a measure of generalizability developed within the domain of information theory [2, 11, 12, 13]. GC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into account rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable technical sophistication to be computed, as it relies on the second derivative of the likelihood. A more important limitation of both PC and GC is that these measures are insensitive to yet another important dimension contributing to model complexity: the prior distribution over the model parameters. The relation between the parameter prior distribution and model complexity is discussed next. 2 2 Model complexity and the parameter prior The growing popularity of Bayesian methods in psychology has not only raised awareness that model complexity should be taken into account when testing models [6], it has also drawn attention to the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods, there is room to incorporate this information in two different flavors: as a prior distribution over the models, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so almost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast, information regarding the parameter is much easier to include, although still challenging (e.g., [16]). There are two ways to formalize prior information about a model’s parameters: using the parameter prior range (often referred to as simply the range) and using the parameter prior distribution (often referred to as simply the prior). The prior range indicates which parameter values are allowed and which are forbidden. The prior distribution indicates which parameter values are likely and which are unlikely. Models that share the same equation and the same range but differ in the prior distribution can be considered different models (or at least different model versions), just like models that share the same equation but differ in range are different model versions. Like the parameter prior range, the parameter prior distribution influences the model complexity. In general, a model with a vague parameter prior distribution is more complex than a model with a sharply peaked parameter prior distribution, much as a model with a broad-ranged parameter is more complex than the same model where the parameter is heavily restricted. To drive home the point that the parameter prior should be considered when model complexity is assessed, consider the following “fair coin” model Mf and a “biased coin” model Mb . There is a clear intuitive complexity difference between these models: Mb is more complex than Mf . The most straightforward way to formalize these models is as follows, where ph denotes the probability of observing heads: ph = 1/2, (4) ph = θ 0≤θ≤1 p(θ) = 1, (5) for model Mf and the triplet of equations jointly define model Mb . The range forbids values smaller than 0 or greater than 1 because ph is a proportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive to the number of parameters, pick up the difference in model complexity between the models. Alternatively, model Mf could be defined as follows: ph = θ 0≤θ≤1 1 p(θ) = δ(θ − ), 2 (6) where δ(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the model formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6, PC and GC now judge Mf and Mb to be equally complex: both models share the same model equation (which implies they have the same number of parameters and the same functional form) and the same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity difference between both models. This misjudgement is a direct result of the insensitivity of these measures to the parameter prior. As models Mf and Mb have different prior distributions over their parameter, a measure sensitive to the prior would pick up the complexity difference between these models. Such a measure is introduced next. 3 The Prior Predictive Complexity Model complexity refers to the property of the model that enables it to predict a wide range of data patterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model 3 can predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity, then, refers to the property of placing restrictions on the possible outcomes: the greater restrictions, the greater the simplicity. To understand how model complexity is measured in the PPC, it is useful to think about the universal interval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could potentially be observed, irrespective of any model. For example, in an experiment with n binomial trials, it is impossible to observe less that zero successes, or more than n successes, so the range of possible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted interval is the interval containing all outcomes the model predicts. An intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative to the cardinality of the universal interval, averaged over all m conditions or stimuli: PPC = 1 m m i=1 |PIi | . |UIi | (7) A key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based model, prediction takes the form of a distribution over all possible outcomes for some future, yet-tobe-observed data d under some model M . This distribution is called the prior predictive distribution (ppd) and can be calculated using the law of total probability: p(d|M ) = p(d|θ, M )p(θ|M )dθ. (8) Ω Predicting the probability of unseen future data d arising under the assumption that model M is true involves integrating the probability of the data for each of the possible parameter values, p(d|θ, M ), as weighted by the prior probability of each of these values, p(θ|M ). Note that the ppd relies on the number of parameters (through the number of integrals and the likelihood), the model equation (through the likelihood), and the parameter range (through Ω). Therefore, as GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the PPC, also relies on the parameter prior. Since predictions are made probabilistically, virtually all outcomes will be assigned some prior weight. This implies that, in principle, the predicted interval equals the universal interval. However, for some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to restrict the predicted interval to the smallest interval that includes some predetermined amount of the prior mass. For example, the 95% predictive interval is defined by those outcomes with the highest prior mass that together make up 95% of the prior mass. Analytical solutions to the integral defining the ppd are rarely available. Instead, one should rely on approximations to the ppd by drawing samples from it. In the current study, sampling was performed using WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package. It contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algorithms to sample from any distribution of interest. 4 An application example The PPC is illustrated by comparing the complexity of two popular models of information integration, which attempt to account for how people merge potentially ambiguous or conflicting information from various sensorial sources to create subjective experience. These models either assume that the sources of information are combined additively (the Linear Integration Model; LIM; [19]) or multiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]). 4.1 Information integration tasks A typical information integration task exposes participants simultaneously to different sources of information and requires this combined experience to be identified in a forced-choice identification task. The presented stimuli are generated from a factorial manipulation of the sources of information by systematically varying the ambiguity of each of the sources. The relevant empirical data consist 4 of, for each of the presented stimuli, the counts km of the number of times the mth stimulus was identified as one of the response alternatives, out of the tm trials on which it was presented. For example, an experiment in phonemic identification could involve two phonemes to be identified, /ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing different levels of audible speech, varying between /ba/ and /da/, with different levels of visible speech, also varying between these alternatives. The resulting set of stimuli spans a continuum between the two syllables. The participant is then asked to listen and to watch the speaker, and based on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/. In the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and visual information) but also unimodal stimuli (providing only a single source of information) are presented. 4.2 Information integration models In what follows, the formal description of the LIM and the FLMP is outlined for a design with two response alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels, respectively. In such a two-choice identification task, the counts km follow a Binomial distribution: km ∼ Binomial(pm , tm ), (9) where pm indicates the probability that the mth stimulus is identified as /da/. 4.2.1 Model equation The probability for the stimulus constructed with the ith level of the first source and the jth level of the second being identified as /da/ is computed according to the choice rule: pij = s (ij, /da/) , s (ij, /da/) + s (ij, /ba/) (10) where s (ij, /da/) represents the overall degree of support for the stimulus to be /da/. The sources of information are assumed to be evaluated independently, implying that different parameters are used for the different modalities. In the present example, the degree of auditory support for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj (j = 1, . . . , J). When a unimodal stimulus is presented, the overall degree of support for each alternative is given by s (i∗, /da/) = ai and s (∗j, /da/) = bj , where the asterisk (*) indicates the absence of information, implying that Equation 10 reduces to pi∗ = ai and p∗j = bj . (11) When a bimodal stimulus is presented, the overall degree of support for each alternative is based on the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) = ai bj , where the operator denotes the combination of both sources. Hence, Equation 10 reduces to ai bj . (12) pij = ai bj + (1 − ai ) (1 − bj ) = +, so Equation 12 becomes The LIM assumes an additive combination, i.e., pij = ai + bj . 2 (13) The FLMP, in contrast, assumes a multiplicative combination, i.e., = ×, so Equation 12 becomes ai bj . ai bj + (1 − ai )(1 − bj ) (14) pij = 5 4.2.2 Parameter prior range and distribution Each level of auditory and visual support for /da/ (i.e., ai and bj , respectively) is associated with a free parameter, which implies that the FLMP and the LIM have an equal number of free parameters, I + J. Each of these parameters is constrained to satisfy 0 ≤ ai , bj ≤ 1. The original formulations of the LIM and FLMP unfortunately left the parameter priors unspecified. However, an implicit assumption that has been commonly used is a uniform prior for each of the parameters. This assumption implicitly underlies classical and widely adopted methods for model evaluation using accounted percentage of variance or maximum likelihood. ai ∼ Uniform(0, 1) and bi ∼ Uniform(0, 1) for i = 1, . . . , I; j = 1, . . . , J. (15) The models relying on this set of uniform priors will be referred to as LIMu and FLMPu . Note that LIMu and FLMPu treat the different parameters as independent. This approach misses important information. In particular, the experimental design is such that the amount of support for each level i + 1 is always higher than for level i. Because parameter ai (or bi ) corresponds to the degree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to expect the following orderings among the parameters to hold (see also [6]): aj > ai and bj > bi for j > i. (16) The models relying on this set of ordered priors will be referred to as LIMo and FLMPo . 4.3 Complexity and experimental design It is tempting to consider model complexity as an inherent characteristic of a model. For some models and for some measures of complexity this is clearly the case. Consider, for example, model Mb . In any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally, this is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as PC depends crucially on (some aspects of) the experimental design. In particular, every level corresponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC is not different in this respect. The design sensitivity implies that one can only make sensible conclusions about differences in model complexity by using different designs. In an information integration task, the design decisions include the type of design (expanded or not), the number of sources, the number of response alternatives, the number of levels for each source, and the number of observations for each stimulus (sample size). The present study focuses on the expanded factorial designs with two sources and two response alternatives. The additional design features were varied: both a 5 × 5 and a 8 × 2 design were considered, using three different sample sizes (20, 60 and 150, following [2]). 4.4 Results Figure 1 shows the 99% predicted interval in the 8×2 design with n = 150. Each panel corresponds to a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The first eight stimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing order of auditory support. The next eight stimuli correspond to the stimuli with the highest level of visual support. The next eight stimuli correspond to the unimodal stimuli where only auditory information is provided (again ranked in increasing order). The final two stimuli are the unimodal visual stimuli. Panel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging between 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass by LIMu , making it almost maximally complex. FLMPu is even more complex. The predicted interval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts virtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant prior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller than their counterparts using the uniform priors. Focusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more complex than the former. This observation holds irrespective of the model version (assuming uniform 6 0.9 0.8 0.8 Proportion of /da/ responses 1 0.9 Proportion of /da/ responses 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 A 1* 0 *1 11 21 B 1* *1 1* *1 0.8 Proportion of /da/ responses 0.9 0.8 21 1 0.9 Proportion of /da/ responses 1 11 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 C 1* 0 *1 D Figure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel A), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D). Table 1: PPC, based on the 99% predicted interval, for four models across six different designs. 20 LIMu FLMPu LIMo FLMPo 5×5 60 150 20 8×2 60 150 0.97 1 0.75 0.83 0.94 1 0.67 0.80 .97 1 0.77 0.86 0.95 1 0.69 0.82 0.93 0.99 0.64 0.78 7 0.94 0.99 0.66 0.81 vs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure the relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4] but see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based approach [2, 6, 22], and a direct computation of the GC [2]. The PPC’s for all six designs considered are displayed in Table 1. It shows that the observations made for the 8 × 2, n = 150 design holds across the five remaining designs as well: LIM is simpler than FLMP; and models assuming ordered priors are simpler than models assuming uniform priors. Note that these conclusions would not have been possible based on PC or GC. For PC, all four models have the same complexity. GC, in contrast, would detect complexity differences between LIM and FLMP (i.e., the first conclusion), but due to its insensitivity to the parameter prior, the complexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the other hand (i.e., the second conclusion) would have gone unnoticed. 5 Discussion A theorist defining a model should clearly and explicitly specify at least the three following pieces of information: the model equation, the parameter prior range, and the parameter prior distribution. If any of these pieces is missing, the model should be regarded as incomplete, and therefore untestable. Consequently, any measure of generalizability should be sensitive to all three aspects of the model definition. Many currently popular generalizability measures do not satisfy this criterion, including AIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into account is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly for its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that the prior is a part of the model definition, I see the sensitivity of the marginal likelihood to the prior as an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive to the prior that miss an important aspect of the model. Similarly, any stand alone measure of model complexity should be sensitive to all three aspects of the model definition, as all three aspects contribute to the model’s complexity (with the model equation contributing two factors: the number of parameters and the functional form). Existing measures of complexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the model equation into account, whereas GC takes only the model equation and the range into account. In contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model complexity using the predicted interval which contains all possible outcomes a model can generate. A narrow predicted interval (relative to the universal interval) indicates a simple model; a complex model is characterized by a wide predicted interval. There is a tight coupling between the notions of information, knowledge and uncertainty, and the notion of model complexity. As parameters correspond to unknown variables, having more information available leads to fewer parameters and hence to a simpler model. Similarly, the more information there is available, the sharper the parameter prior, implying a simpler model. To put it differently, the less uncertainty present in a model, the narrower its predicted interval, and the simpler the model. For example, in model Mb , there is maximal uncertainty. Nothing but the range is known about θ, so all values of θ are equally likely. In contrast, in model Mf , there is minimal uncertainty. In fact, ph is known for sure, so only a single value of θ is possible. This difference in uncertainty is translated in a difference in complexity. The same is true for the information integration models. Incorporating the order constraints in the priors reduces the uncertainty compared to the models without these constraints (it tells you, for example, that parameter a1 is smaller than a2 ). This reduction in uncertainty is reflected by a smaller complexity. There are many different sources of prior information that can be translated in a range or distribution. The illustration using the information integration models highlighted that prior information can reflect meaningful information in the design. Alternatively, priors can be informed by previous applications of similar models in similar settings. Probably the purest form of priors are those that translate theoretical assumptions made by a model (see [16]). The fact that it is often difficult to formalize this prior information may not be used as an excuse to leave the prior unspecified. Sure it is a challenging task, but so is translating theoretical assumptions into the model equation. Formalizing theory, intuitions, and information is what model building is all about. 8 References [1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. [2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. [3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248–1284. [4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364–381. [5] Dunn, J. (2000) Model complexity: The fit to random data reconsidered. Psychological Research, 63, 174–182. [6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. [7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior Research Methods. [8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B. and Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267–281, Academiai Kiado. [9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170–11175. [11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of Classification, 19, 69–85. [12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. [13] Gr¨ nwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical u Psychology, 44, 133–152. [14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668. [15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category representation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403–1424. [16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the generalized context model. [17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. [18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medical Research Council Biostatistics Unit. Institute of Public Health, Cambridge. [19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press. [20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psychological Review, 85, 172–191. [21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press. [22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review, 8, 1–17. [23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. [24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 53, 362–375. 9
3 0.60772026 39 nips-2009-Bayesian Belief Polarization
Author: Alan Jern, Kai-min Chang, Charles Kemp
Abstract: Empirical studies have documented cases of belief polarization, where two people with opposing prior beliefs both strengthen their beliefs after observing the same evidence. Belief polarization is frequently offered as evidence of human irrationality, but we demonstrate that this phenomenon is consistent with a fully Bayesian approach to belief revision. Simulation results indicate that belief polarization is not only possible but relatively common within the set of Bayesian models that we consider. Suppose that Carol has requested a promotion at her company and has received a score of 50 on an aptitude test. Alice, one of the company’s managers, began with a high opinion of Carol and became even more confident of her abilities after seeing her test score. Bob, another manager, began with a low opinion of Carol and became even less confident about her qualifications after seeing her score. On the surface, it may appear that either Alice or Bob is behaving irrationally, since the same piece of evidence has led them to update their beliefs about Carol in opposite directions. This situation is an example of belief polarization [1, 2], a widely studied phenomenon that is often taken as evidence of human irrationality [3, 4]. In some cases, however, belief polarization may appear much more sensible when all the relevant information is taken into account. Suppose, for instance, that Alice was familiar with the aptitude test and knew that it was scored out of 60, but that Bob was less familiar with the test and assumed that the score was a percentage. Even though only one interpretation of the score can be correct, Alice and Bob have both made rational inferences given their assumptions about the test. Some instances of belief polarization are almost certain to qualify as genuine departures from rational inference, but we argue in this paper that others will be entirely compatible with a rational approach. Distinguishing between these cases requires a precise normative standard against which human inferences can be compared. We suggest that Bayesian inference provides this normative standard, and present a set of Bayesian models that includes cases where polarization can and cannot emerge. Our work is in the spirit of previous studies that use careful rational analyses in order to illuminate apparently irrational human behavior (e.g. [5, 6, 7]). Previous studies of belief polarization have occasionally taken a Bayesian approach, but often the goal is to show how belief polarization can emerge as a consequence of approximate inference in a Bayesian model that is subject to memory constraints or processing limitations [8]. In contrast, we demonstrate that some examples of polarization are compatible with a fully Bayesian approach. Other formal accounts of belief polarization have relied on complex versions of utility theory [9], or have focused on continuous hypothesis spaces [10] unlike the discrete hypothesis spaces usually considered by psychological studies of belief polarization. We focus on discrete hypothesis spaces and require no additional machinery beyond the basics of Bayesian inference. We begin by introducing the belief revision phenomena considered in this paper and developing a Bayesian approach that clarifies whether and when these phenomena should be considered irrational. We then consider several Bayesian models that are capable of producing belief polarization and illustrate them with concrete examples. Having demonstrated that belief polarization is compatible 1 (a) Contrary updating (i) Divergence (ii) (b) Parallel updating Convergence A P (h1 ) 0.5 0.5 0.5 B Prior beliefs Updated beliefs Prior beliefs Updated beliefs Prior beliefs Updated beliefs Figure 1: Examples of belief updating behaviors for two individuals, A (solid line) and B (dashed line). The individuals begin with different beliefs about hypothesis h1 . After observing the same set of evidence, their beliefs may (a) move in opposite directions or (b) move in the same direction. with a Bayesian approach, we present simulations suggesting that this phenomenon is relatively generic within the space of models that we consider. We finish with some general comments on human rationality and normative models. 1 Belief revision phenomena The term “belief polarization” is generally used to describe situations in which two people observe the same evidence and update their respective beliefs in the directions of their priors. A study by Lord, et al. [1] provides one classic example in which participants read about two studies, one of which concluded that the death penalty deters crime and another which concluded that the death penalty has no effect on crime. After exposure to this mixed evidence, supporters of the death penalty strengthened their support and opponents strengthened their opposition. We will treat belief polarization as a special case of contrary updating, a phenomenon where two people update their beliefs in opposite directions after observing the same evidence (Figure 1a). We distinguish between two types of contrary updating. Belief divergence refers to cases in which the person with the stronger belief in some hypothesis increases the strength of his or her belief and the person with the weaker belief in the hypothesis decreases the strength of his or her belief (Figure 1a(i)). Divergence therefore includes cases of traditional belief polarization. The opposite of divergence is belief convergence (Figure 1a(ii)), in which the person with the stronger belief decreases the strength of his or her belief and the person with the weaker belief increases the strength of his or her belief. Contrary updating may be contrasted with parallel updating (Figure 1b), in which the two people update their beliefs in the same direction. Throughout this paper, we consider only situations in which both people change their beliefs after observing some evidence. All such situations can be unambiguously classified as instances of parallel or contrary updating. Parallel updating is clearly compatible with a normative approach, but the normative status of divergence and convergence is less clear. Many authors argue that divergence is irrational, and many of the same authors also propose that convergence is rational [2, 3]. For example, Baron [3] writes that “Normatively, we might expect that beliefs move toward the middle of the range when people are presented with mixed evidence.” (p. 210) The next section presents a formal analysis that challenges the conventional wisdom about these phenomena and clarifies the cases where they can be considered rational. 2 A Bayesian approach to belief revision Since belief revision involves inference under uncertainty, Bayesian inference provides the appropriate normative standard. Consider a problem where two people observe data d that bear on some hypothesis h1 . Let P1 (·) and P2 (·) be distributions that capture the two people’s respective beliefs. Contrary updating occurs whenever one person’s belief in h1 increases and the other person’s belief in h1 decreases, or when [P1 (h1 |d) − P1 (h1 )] [P2 (h1 |d) − P2 (h1 )] < 0 . 2 (1) Family 1 (a) H (c) (d) (e) V H D Family 2 (b) V V V D H D H D H (f) (g) V V D H D (h) V H D H D Figure 2: (a) A simple Bayesian network that cannot produce either belief divergence or belief convergence. (b) – (h) All possible three-node Bayes nets subject to the constraints described in the text. Networks in Family 1 can produce only parallel updating, but networks in Family 2 can produce both parallel and contrary updating. We will use Bayesian networks to capture the relationships between H, D, and any other variables that are relevant to the situation under consideration. For example, Figure 2a captures the idea that the data D are probabilistically generated from hypothesis H. The remaining networks in Figure 2 show several other ways in which D and H may be related, and will be discussed later. We assume that the two individuals agree on the variables that are relevant to a problem and agree about the relationships between these variables. We can formalize this idea by requiring that both people agree on the structure and the conditional probability distributions (CPDs) of a network N that captures relationships between the relevant variables, and that they differ only in the priors they assign to the root nodes of N . If N is the Bayes net in Figure 2a, then we assume that the two people must agree on the distribution P (D|H), although they may have different priors P1 (H) and P2 (H). If two people agree on network N but have different priors on the root nodes, we can create a single expanded Bayes net to simulate the inferences of both individuals. The expanded network is created by adding a background knowledge node B that sends directed edges to all root nodes in N , and acts as a switch that sets different root node priors for the two different individuals. Given this expanded network, distributions P1 and P2 in Equation 1 can be recovered by conditioning on the value of the background knowledge node and rewritten as [P (h1 |d, b1 ) − P (h1 |b1 )] [P (h1 |d, b2 ) − P (h1 |b2 )] < 0 (2) where P (·) represents the probability distribution captured by the expanded network. Suppose that there are exactly two mutually exclusive hypotheses. For example, h1 and h0 might state that the death penalty does or does not deter crime. In this case Equation 2 implies that contrary updating occurs when [P (d|h1 , b1 ) − P (d|h0 , b1 )] [P (d|h1 , b2 ) − P (d|h0 , b2 )] < 0 . (3) Equation 3 is derived in the supporting material, and leads immediately to the following result: R1: If H is a binary variable and D and B are conditionally independent given H, then contrary updating is impossible. Result R1 follows from the observation that if D and B are conditionally independent given H, then the product in Equation 3 is equal to (P (d|h1 ) − P (d|h0 ))2 , which cannot be less than zero. R1 implies that the simple Bayes net in Figure 2a is incapable of producing contrary updating, an observation previously made by Lopes [11]. Our analysis may help to explain the common intuition that belief divergence is irrational, since many researchers seem to implicitly adopt a model in which H and D are the only relevant variables. Network 2a, however, is too simple to capture the causal relationships that are present in many real world situations. For example, the promotion example at the beginning of this paper is best captured using a network with an additional node that represents the grading scale for the aptitude test. Networks with many nodes may be needed for some real world problems, but here we explore the space of three-node networks. We restrict our attention to connected graphs in which D has no outgoing edges, motivated by the idea that the three variables should be linked and that the data are the final result of some generative process. The seven graphs that meet these conditions are shown in Figures 2b–h, where the additional variable has been labeled V . These Bayes nets illustrate cases in which (b) V is an additional 3 Models Conventional wisdom Family 1 Family 2 Belief divergence Belief convergence Parallel updating Table 1: The first column represents the conventional wisdom about which belief revision phenomena are normative. The models in the remaining columns include all three-node Bayes nets. This set of models can be partitioned into those that support both belief divergence and convergence (Family 2) and those that support neither (Family 1). piece of evidence that bears on H, (c) V informs the prior probability of H, (d)–(e) D is generated by an intervening variable V , (f) V is an additional generating factor of D, (g) V informs both the prior probability of H and the likelihood of D, and (h) H and D are both effects of V . The graphs in Figure 2 have been organized into two families. R1 implies that none of the graphs in Family 1 is capable of producing contrary updating. The next section demonstrates by example that all three of the graphs in Family 2 are capable of producing contrary updating. Table 1 compares the two families of Bayes nets to the informal conclusions about normative approaches that are often found in the psychological literature. As previously noted, the conventional wisdom holds that belief divergence is irrational but that convergence and parallel updating are both rational. Our analysis suggests that this position has little support. Depending on the causal structure of the problem under consideration, a rational approach should allow both divergence and convergence or neither. Although we focus in this paper on Bayes nets with no more than three nodes, the class of all network structures can be partitioned into those that can (Family 2) and cannot (Family 1) produce contrary updating. R1 is true for Bayes nets of any size and characterizes one group of networks that belong to Family 1. Networks where the data provide no information about the hypotheses must also fail to produce contrary updating. Note that if D and H are conditionally independent given B, then the left side of Equation 3 is equal to zero, meaning contrary updating cannot occur. We conjecture that all remaining networks can produce contrary updating if the cardinalities of the nodes and the CPDs are chosen appropriately. Future studies can attempt to verify this conjecture and to precisely characterize the CPDs that lead to contrary updating. 3 Examples of rational belief divergence We now present four scenarios that can be modeled by the three-node Bayes nets in Family 2. Our purpose in developing these examples is to demonstrate that these networks can produce belief divergence and to provide some everyday examples in which this behavior is both normative and intuitive. 3.1 Example 1: Promotion We first consider a scenario that can be captured by Bayes net 2f, in which the data depend on two independent factors. Recall the scenario described at the beginning of this paper: Alice and Bob are responsible for deciding whether to promote Carol. For simplicity, we consider a case where the data represent a binary outcome—whether or not Carol’s r´ sum´ indicates that she is included e e in The Directory of Notable People—rather than her score on an aptitude test. Alice believes that The Directory is a reputable publication but Bob believes it is illegitimate. This situation is represented by the Bayes net and associated CPDs in Figure 3a. In the tables, the hypothesis space H = {‘Unqualified’ = 0, ‘Qualified’ = 1} represents whether or not Carol is qualified for the promotion, the additional factor V = {‘Disreputable’ = 0, ‘Reputable’ = 1} represents whether The Directory is a reputable publication, and the data variable D = {‘Not included’ = 0, ‘Included’ = 1} represents whether Carol is featured in it. The actual probabilities were chosen to reflect the fact that only an unqualified person is likely to pad their r´ sum´ by mentioning a disreputable publication, but that e e 4 (a) B Alice Bob (b) P(V=1) 0.01 0.9 B Alice Bob V B Alice Bob P(H=1) 0.6 0.4 V H D V 0 0 1 1 H 0 1 0 1 V 0 1 P(D=1) 0.5 0.1 0.1 0.9 (c) P(H=1) 0.1 0.9 H V 0 0 1 1 D H 0 1 0 1 P(D=1) 0.4 0.01 0.4 0.6 (d) B Alice Bob P(V=0) P(V=1) P(V=2) P(V=3) 0.6 0.2 0.1 0.1 0.1 0.1 0.2 0.6 B Alice Bob P(V1=1) 0.9 0.1 P(H=1) 1 1 0 0 H B Alice Bob V1 V V 0 1 2 3 P(V=1) 0.9 0.1 D V 0 1 2 3 P(D=0) P(D=1) P(D=2) P(D=3) 0.7 0.1 0.1 0.1 0.1 0.7 0.1 0.1 0.1 0.1 0.7 0.1 0.1 0.1 0.1 0.7 V1 0 0 1 1 V2 0 1 0 1 P(H=1) 0.5 0.1 0.5 0.9 P(V2=1) 0.5 0.5 V2 H D V2 0 1 P(D=1) 0.1 0.9 Figure 3: The Bayes nets and conditional probability distributions used in (a) Example 1: Promotion, (b) Example 2: Religious belief, (c) Example 3: Election polls, (d) Example 4: Political belief. only a qualified person is likely to be included in The Directory if it is reputable. Note that Alice and Bob agree on the conditional probability distribution for D, but assign different priors to V and H. Alice and Bob therefore interpret the meaning of Carol’s presence in The Directory differently, resulting in the belief divergence shown in Figure 4a. This scenario is one instance of a large number of belief divergence cases that can be attributed to two individuals possessing different mental models of how the observed evidence was generated. For instance, suppose now that Alice and Bob are both on an admissions committee and are evaluating a recommendation letter for an applicant. Although the letter is positive, it is not enthusiastic. Alice, who has less experience reading recommendation letters interprets the letter as a strong endorsement. Bob, however, takes the lack of enthusiasm as an indication that the author has some misgivings [12]. As in the promotion scenario, the differences in Alice’s and Bob’s experience can be effectively represented by the priors they assign to the H and V nodes in a Bayes net of the form in Figure 2f. 3.2 Example 2: Religious belief We now consider a scenario captured by Bayes net 2g. In our example for Bayes net 2f, the status of an additional factor V affected how Alice and Bob interpreted the data D, but did not shape their prior beliefs about H. In many cases, however, the additional factor V will influence both people’s prior beliefs about H as well as their interpretation of the relationship between D and H. Bayes net 2g captures this situation, and we provide a concrete example inspired by an experiment conducted by Batson [13]. Suppose that Alice believes in a “Christian universe:” she believes in the divinity of Jesus Christ and expects that followers of Christ will be persecuted. Bob, on the other hand, believes in a “secular universe.” This belief leads him to doubt Christ’s divinity, but to believe that if Christ were divine, his followers would likely be protected rather than persecuted. Now suppose that both Alice and Bob observe that Christians are, in fact, persecuted, and reassess the probability of Christ’s divinity. This situation is represented by the Bayes net and associated CPDs in Figure 3b. In the tables, the hypothesis space H = {‘Human’ = 0, ‘Divine’ = 1} represents the divinity of Jesus Christ, the additional factor V = {‘Secular’ = 0, ‘Christian’ = 1} represents the nature of the universe, and the data variable D = {‘Not persecuted’ = 0, ‘Persecuted’ = 1} represents whether Christians are subject to persecution. The exact probabilities were chosen to reflect the fact that, regardless of worldview, people will agree on a “base rate” of persecution given that Christ is not divine, but that more persecution is expected if the Christian worldview is correct than if the secular worldview is correct. Unlike in the previous scenario, Alice and Bob agree on the CPDs for both D and H, but 5 (a) (b) P (H = 1) (d) 1 1 1 0.5 1 (c) 0.5 0.5 A 0.5 B 0 0 0 Prior beliefs Updated beliefs Prior beliefs Updated beliefs 0 Prior beliefs Updated beliefs Prior beliefs Updated beliefs Figure 4: Belief revision outcomes for (a) Example 1: Promotion, (b) Example 2: Religious belief, (c) Example 3: Election polls, and (d) Example 4: Political belief. In all four plots, the updated beliefs for Alice (solid line) and Bob (dashed line) are computed after observing the data described in the text. The plots confirm that all four of our example networks can lead to belief divergence. differ in the priors they assign to V . As a result, Alice and Bob disagree about whether persecution supports or undermines a Christian worldview, which leads to the divergence shown in Figure 4b. This scenario is analogous to many real world situations in which one person has knowledge that the other does not. For instance, in a police interrogation, someone with little knowledge of the case (V ) might take a suspect’s alibi (D) as strong evidence of their innocence (H). However, a detective with detailed knowledge of the case may assign a higher prior probability to the subject’s guilt based on other circumstantial evidence, and may also notice a detail in the suspect’s alibi that only the culprit would know, thus making the statement strong evidence of guilt. In all situations of this kind, although two people possess different background knowledge, their inferences are normative given that knowledge, consistent with the Bayes net in Figure 2g. 3.3 Example 3: Election polls We now consider two qualitatively different cases that are both captured by Bayes net 2h. The networks considered so far have all included a direct link between H and D. In our next two examples, we consider cases where the hypotheses and observed data are not directly linked, but are coupled by means of one or more unobserved causal factors. Suppose that an upcoming election will be contested by two Republican candidates, Rogers and Rudolph, and two Democratic candidates, Davis and Daly. Alice and Bob disagree about the various candidates’ chances of winning, with Alice favoring the two Republicans and Bob favoring the two Democrats. Two polls were recently released, one indicating that Rogers was most likely to win the election and the other indicating that Daly was most likely to win. After considering these polls, they both assess the likelihood that a Republican will win the election. This situation is represented by the Bayes net and associated CPDs in Figure 3c. In the tables, the hypothesis space H = {‘Democrat wins’ = 0, ‘Republican wins’ = 1} represents the winning party, the variable V = {‘Rogers’ = 0, ‘Rudolph’ = 1, ‘Davis’ = 2, ‘Daly’ = 3} represents the winning candidate, and the data variables D1 = D2 = {‘Rogers’ = 0, ‘Rudolph’ = 1, ‘Davis’ = 2, ‘Daly’ = 3} represent the results of the two polls. The exact probabilities were chosen to reflect the fact that the polls are likely to reflect the truth with some noise, but whether a Democrat or Republican wins is completely determined by the winning candidate V . In Figure 3c, only a single D node is shown because D1 and D2 have identical CPDs. The resulting belief divergence is shown in Figure 4c. Note that in this scenario, Alice’s and Bob’s different priors cause them to discount the poll that disagrees with their existing beliefs as noise, thus causing their prior beliefs to be reinforced by the mixed data. This scenario was inspired by the death penalty study [1] alluded to earlier, in which a set of mixed results caused supporters and opponents of the death penalty to strengthen their existing beliefs. We do not claim that people’s behavior in this study can be explained with exactly the model employed here, but our analysis does show that selective interpretation of evidence is sometimes consistent with a rational approach. 6 3.4 Example 4: Political belief We conclude with a second illustration of Bayes net 2h in which two people agree on the interpretation of an observed piece of evidence but disagree about the implications of that evidence. In this scenario, Alice and Bob are two economists with different philosophies about how the federal government should approach a major recession. Alice believes that the federal government should increase its own spending to stimulate economic activity; Bob believes that the government should decrease its spending and reduce taxes instead, providing taxpayers with more spending money. A new bill has just been proposed and an independent study found that the bill was likely to increase federal spending. Alice and Bob now assess the likelihood that this piece of legislation will improve the economic climate. This scenario can be modeled by the Bayes net and associated CPDs in Figure 3d. In the tables, the hypothesis space H = {‘Bad policy’ = 0, ‘Good policy’ = 1} represents whether the new bill is good for the economy and the data variable D = {‘No spending’ = 0, ‘Spending increase’ = 1} represents the conclusions of the independent study. Unlike in previous scenarios, we introduce two additional factors, V 1 = {‘Fiscally conservative’ = 0, ‘Fiscally liberal’ = 1}, which represents the optimal economic philosophy, and V 2 = {‘No spending’ = 0, ‘Spending increase’ = 1}, which represents the spending policy of the new bill. The exact probabilities in the tables were chosen to reflect the fact that if the bill does not increase spending, the policy it enacts may still be good for other reasons. A uniform prior was placed on V 2 for both people, reflecting the fact that they have no prior expectations about the spending in the bill. However, the priors placed on V 1 for Alice and Bob reflect their different beliefs about the best economic policy. The resulting belief divergence behavior is shown in Figure 4d. The model used in this scenario bears a strong resemblance to the probabilogical model of attitude change developed by McGuire [14] in which V 1 and V 2 might be logical “premises” that entail the “conclusion” H. 4 How common is contrary updating? We have now described four concrete cases where belief divergence is captured by a normative approach. It is possible, however, that belief divergence is relatively rare within the Bayes nets of Family 2, and that our four examples are exotic special cases that depend on carefully selected CPDs. To rule out this possibility, we ran simulations to explore the space of all possible CPDs for the three networks in Family 2. We initially considered cases where H, D, and V were binary variables, and ran two simulations for each model. In one simulation, the priors and each row of each CPD were sampled from a symmetric Beta distribution with parameter 0.1, resulting in probabilities highly biased toward 0 and 1. In the second simulation, the probabilities were sampled from a uniform distribution. In each trial, a single set of CPDs were generated and then two different priors were generated for each root node in the graph to simulate two individuals, consistent with our assumption that two individuals may have different priors but must agree about the conditional probabilities. 20,000 trials were carried out in each simulation, and the proportion of trials that led to convergence and divergence was computed. Trials were only counted as instances of convergence or divergence if |P (H = 1|D = 1) − P (H = 1)| > for both individuals, with = 1 × 10−5 . The results of these simulations are shown in Table 2. The supporting material proves that divergence and convergence are equally common, and therefore the percentages in the table show the frequencies for contrary updating of either type. Our primary question was whether contrary updating is rare or anomalous. In all but the third simulation, contrary updating constituted a substantial proportion of trials, suggesting that the phenomenon is relatively generic. We were also interested in whether this behavior relied on particular settings of the CPDs. The fact that percentages for the uniform distribution are approximately the same or greater than for the biased distribution indicates that contrary updating appears to be a relatively generic behavior for the Bayes nets we considered. More generally, these results directly challenge the suggestion that normative accounts are not suited for modeling belief divergence. The last two columns of Table 2 show results for two simulations with the same Bayes net, the only difference being whether V was treated as 2-valued (binary) or 4-valued. The 4-valued case is included because both Examples 3 and 4 considered multi-valued additional factor variables V . 7 2-valued V V H Biased Uniform 4-valued V V V V D 9.6% 18.2% D H 12.7% 16.0% H D 0% 0% H D 23.3% 20.0% Table 2: Simulation results. The percentages indicate the proportion of trials that produced contrary updating using the specified Bayes net (column) and probability distributions (row). The prior and conditional probabilities were either sampled from a Beta(0.1, 0.1) distribution (biased) or a Beta(1, 1) distribution (uniform). The probabilities for the simulation results shown in the last column were sampled from a Dirichlet([0.1, 0.1, 0.1, 0.1]) distribution (biased) or a Dirichlet([1, 1, 1, 1]) distribution (uniform). In Example 4, we used two binary variables, but we could have equivalently used a single 4-valued variable. Belief convergence and divergence are not possible in the binary case, a result that is proved in the supporting material. We believe, however, that convergence and divergence are fairly common whenever V takes three or more values, and the simulation in the last column of the table confirms this claim for the 4-valued case. Given that belief divergence seems relatively common in the space of all Bayes nets, it is natural to explore whether cases of rational divergence are regularly encountered in the real world. One possible approach is to analyze a large database of networks that capture everyday belief revision problems, and to determine what proportion of networks lead to rational divergence. Future studies can explore this issue, but our simulations suggest that contrary updating is likely to arise in cases where it is necessary to move beyond a simple model like the one in Figure 2a and consider several causal factors. 5 Conclusion This paper presented a family of Bayes nets that can account for belief divergence, a phenomenon that is typically considered to be incompatible with normative accounts. We provided four concrete examples that illustrate how this family of networks can capture a variety of settings where belief divergence can emerge from rational statistical inference. We also described a series of simulations that suggest that belief divergence is not only possible but relatively common within the family of networks that we considered. Our work suggests that belief polarization should not always be taken as evidence of irrationality, and that researchers who aim to document departures from rationality may wish to consider alternative phenomena instead. One such phenomenon might be called “inevitable belief reinforcement” and occurs when supporters of a hypothesis update their belief in the same direction for all possible data sets d. For example, a gambler will demonstrate inevitable belief reinforcement if he or she becomes increasingly convinced that a roulette wheel is biased towards red regardless of whether the next spin produces red, black, or green. This phenomenon is provably inconsistent with any fully Bayesian approach, and therefore provides strong evidence of irrationality. Although we propose that some instances of polarization are compatible with a Bayesian approach, we do not claim that human inferences are always or even mostly rational. We suggest, however, that characterizing normative behavior can require careful thought, and that formal analyses are invaluable for assessing the rationality of human inferences. In some cases, a formal analysis will provide an appropriate baseline for understanding how human inferences depart from rational norms. In other cases, a formal analysis will suggest that an apparently irrational inference makes sense once all of the relevant information is taken into account. 8 References [1] C. G. Lord, L. Ross, and M. R. Lepper. Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37(1):2098–2109, 1979. [2] L. Ross and M. R. Lepper. The perseverance of beliefs: Empirical and normative considerations. In New directions for methodology of social and behavioral science: Fallible judgment in behavioral research. Jossey-Bass, San Francisco, 1980. [3] J. Baron. Thinking and Deciding. Cambridge University Press, Cambridge, 4th edition, 2008. [4] A. Gerber and D. Green. Misperceptions about perceptual bias. Annual Review of Political Science, 2:189–210, 1999. [5] M. Oaksford and N. Chater. A rational analysis of the selection task as optimal data selection. Psychological Review, 101(4):608–631, 1994. [6] U. Hahn and M. Oaksford. The rationality of informal argumentation: A Bayesian approach to reasoning fallacies. Psychological Review, 114(3):704–732, 2007. [7] S. Sher and C. R. M. McKenzie. Framing effects and rationality. In N. Chater and M. Oaksford, editors, The probablistic mind: Prospects for Bayesian cognitive science. Oxford University Press, Oxford, 2008. [8] B. O’Connor. Biased evidence assimilation under bounded Bayesian rationality. Master’s thesis, Stanford University, 2006. [9] A. Zimper and A. Ludwig. Attitude polarization. Technical report, Mannheim Research Institute for the Economics of Aging, 2007. [10] A. K. Dixit and J. W. Weibull. Political polarization. Proceedings of the National Academy of Sciences, 104(18):7351–7356, 2007. [11] L. L. Lopes. Averaging rules and adjustment processes in Bayesian inference. Bulletin of the Psychonomic Society, 23(6):509–512, 1985. [12] A. Harris, A. Corner, and U. Hahn. “Damned by faint praise”: A Bayesian account. In A. D. De Groot and G. Heymans, editors, Proceedings of the 31th Annual Conference of the Cognitive Science Society, Austin, TX, 2009. Cognitive Science Society. [13] C. D. Batson. Rational processing or rationalization? The effect of disconfirming information on a stated religious belief. Journal of Personality and Social Psychology, 32(1):176–184, 1975. [14] W. J. McGuire. The probabilogical model of cognitive structure and attitude change. In R. E. Petty, T. M. Ostrom, and T. C. Brock, editors, Cognitive Responses in Persuasion. Lawrence Erlbaum Associates, 1981. 9
4 0.60469848 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing
Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz
Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1
5 0.5989759 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
Author: Ruslan Salakhutdinov
Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.
6 0.54025042 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities
7 0.53314948 25 nips-2009-Adaptive Design Optimization in Experiments with People
8 0.53130579 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion
9 0.52319312 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
10 0.52171046 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
11 0.5168972 115 nips-2009-Individuation, Identification and Object Discovery
12 0.50501573 187 nips-2009-Particle-based Variational Inference for Continuous Systems
14 0.49219397 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
15 0.48922357 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
16 0.480537 164 nips-2009-No evidence for active sparsification in the visual cortex
17 0.47940892 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory
18 0.47123677 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
19 0.46432966 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis
20 0.45571417 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
topicId topicWeight
[(21, 0.018), (24, 0.022), (25, 0.08), (32, 0.014), (35, 0.054), (36, 0.082), (37, 0.013), (39, 0.103), (55, 0.012), (58, 0.067), (68, 0.228), (71, 0.071), (81, 0.028), (86, 0.065), (91, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.82720822 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum
Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1
2 0.75493765 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs
Author: Bernhard Nessler, Michael Pfeiffer, Wolfgang Maass
Abstract: The principles by which spiking neurons contribute to the astounding computational power of generic cortical microcircuits, and how spike-timing-dependent plasticity (STDP) of synaptic weights could generate and maintain this computational function, are unknown. We show here that STDP, in conjunction with a stochastic soft winner-take-all (WTA) circuit, induces spiking neurons to generate through their synaptic weights implicit internal models for subclasses (or “causes”) of the high-dimensional spike patterns of hundreds of pre-synaptic neurons. Hence these neurons will fire after learning whenever the current input best matches their internal model. The resulting computational function of soft WTA circuits, a common network motif of cortical microcircuits, could therefore be a drastic dimensionality reduction of information streams, together with the autonomous creation of internal models for the probability distributions of their input patterns. We show that the autonomous generation and maintenance of this computational function can be explained on the basis of rigorous mathematical principles. In particular, we show that STDP is able to approximate a stochastic online Expectation-Maximization (EM) algorithm for modeling the input data. A corresponding result is shown for Hebbian learning in artificial neural networks. 1
3 0.67569453 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions
Author: Matthias Bethge, Eero P. Simoncelli, Fabian H. Sinz
Abstract: We introduce a new family of distributions, called Lp -nested symmetric distributions, whose densities are expressed in terms of a hierarchical cascade of Lp norms. This class generalizes the family of spherically and Lp -spherically symmetric distributions which have recently been successfully used for natural image modeling. Similar to those distributions it allows for a nonlinear mechanism to reduce the dependencies between its variables. With suitable choices of the parameters and norms, this family includes the Independent Subspace Analysis (ISA) model as a special case, which has been proposed as a means of deriving filters that mimic complex cells found in mammalian primary visual cortex. Lp -nested distributions are relatively easy to estimate and allow us to explore the variety of models between ISA and the Lp -spherically symmetric models. By fitting the generalized Lp -nested model to 8 × 8 image patches, we show that the subspaces obtained from ISA are in fact more dependent than the individual filter coefficients within a subspace. When first applying contrast gain control as preprocessing, however, there are no dependencies left that could be exploited by ISA. This suggests that complex cell modeling can only be useful for redundancy reduction in larger image patches. 1
4 0.63438886 154 nips-2009-Modeling the spacing effect in sequential category learning
Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille
Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently perform model selection to decide whether observations are from one or multiple categories.
5 0.63134426 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.
6 0.62830877 133 nips-2009-Learning models of object structure
8 0.6213854 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
9 0.62067103 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions
10 0.61937422 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
11 0.61893415 112 nips-2009-Human Rademacher Complexity
12 0.61851519 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
13 0.61812526 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
14 0.61786491 226 nips-2009-Spatial Normalized Gamma Processes
15 0.61608493 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
16 0.61443555 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
17 0.61413026 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
18 0.61356878 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis
19 0.61255437 115 nips-2009-Individuation, Identification and Object Discovery
20 0.61090559 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity