nips nips2005 nips2005-35 knowledge-graph by maker-knowledge-mining

35 nips-2005-Bayesian model learning in human visual perception


Source: pdf

Author: Gergő Orbán, Jozsef Fiser, Richard N. Aslin, Máté Lengyel

Abstract: Humans make optimal perceptual decisions in noisy and ambiguous conditions. Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian model learning in human visual perception Gerg˝ Orb´ n o a Collegium Budapest Institute for Advanced Study 2 Szenth´ roms´ g utca, Budapest, a a 1014 Hungary ogergo@colbud. [sent-1, score-0.195]

2 Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. [sent-10, score-0.117]

3 We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. [sent-11, score-0.068]

4 We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. [sent-12, score-0.254]

5 We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. [sent-13, score-0.117]

6 These studies demonstrated that human observers parse sensory scenes by performing optimal estimation of the parameters of the objects involved [3, 4, 5]. [sent-15, score-0.26]

7 A core element of this Bayesian probabilistic framework is an internal model of the world, the generative model, that serves as a basis for inference. [sent-17, score-0.12]

8 In principle, inference can be performed on several levels: the generative model can be used for inferring the values of hidden variables from observed information, but also the model itself may be inferred from previous experience [7]. [sent-18, score-0.215]

9 Most previous studies testing the Bayesian framework in human psychophysical experiments used highly restricted generative models of perception, usually consisting of a few observed and latent variables, of which only a limited number of parameters needed to be adjusted by experience. [sent-19, score-0.261]

10 More importantly, the generative models considered in these studies were tailor-made to the specific pscychophysical task presented in the experiment. [sent-20, score-0.12]

11 Thus, it remains to be shown whether more flexible, ‘open-ended’ generative models are used and learned by humans during perception. [sent-21, score-0.189]

12 This process leads to the extraction of independent causes that efficiently and sufficiently account for sensory experience, without a pre-specification of the number or complexity of potential causes. [sent-24, score-0.105]

13 Next, the mathematical framework is presented that is used to study model learning in SBNs (Section 3). [sent-26, score-0.092]

14 In Section 4, experimental results on human performance are compared to the prediction of our Bayes-optimal model learning in the SBN framework. [sent-27, score-0.136]

15 All the presented human experimental results were reproduced and had identical roots in our simulations: the modal model developed latent variables corresponding to the unknown underlying causes that generated the training scenes. [sent-28, score-0.391]

16 2 Experimental paradigm Human adult subjects were trained and then tested in an unsupervised learning paradigm with a set of complex visual scenes consisting of 6 of 12 abstract unfamiliar black shapes arranged on a 3x3 (Exp 1) or 5x5 (Exps 2-4) white grid (Fig. [sent-32, score-0.531]

17 Unbeknownst to subjects, various subsets of the shapes were arranged into fixed spatial combinations (combos) (doublets, triplets, quadruplets, depending on the experiment). [sent-34, score-0.273]

18 Whenever a combo appeared on a training scene, its constituent shapes were presented in an invariant spatial arrangement, and in no scenes elements of a combo could appear without all the other elements of the same combo also appearing. [sent-35, score-1.24]

19 Subjects were presented with 100–200 training scenes, each scene was presented for 2 seconds with a 1-second pause between scenes. [sent-36, score-0.225]

20 No specific instructions were given to subjects prior to training, they were only asked to pay attention to the continuous sequence of scenes. [sent-37, score-0.089]

21 The test phase consisted of 2AFC trials, in which two arrangements of shapes were shown sequentially in the same grid that was used in the training, and subjects were asked which of the two scenes was more familiar based on the training. [sent-38, score-0.495]

22 One of the presented scenes was either a combo that was actually used for constructing the training set (true combo), or a part of it (embedded combo) (e. [sent-39, score-0.443]

23 , a pair of adjacent shapes from a triplet or quadruplet combo). [sent-41, score-0.404]

24 The other scene consisted of the same number of shapes as the first scene in an arrangement that might or might not have occurred during training, but was in fact a mixture of shapes from different true combos (mixture combo). [sent-42, score-1.263]

25 Here four experiments are considered that assess various aspects of human observational wx x1 w11 y1 wx 1 w12 x2 w22 y2 wy 1 2 w24 w23 y3 wy 2 y4 wy 3 wy 4 Figure 1: Experimental design (left panel) and explanation of graphical model parameters (right panel). [sent-43, score-0.42]

26 Our first goal was to establish that humans are sensitive to the statistical structure of visual experience, and use this experience for judging familiarity. [sent-47, score-0.193]

27 In the baseline experiment 6 doublet combos were defined, three of which were presented simultaneously in any given training scene, allowing 144 possible scenes [8]. [sent-48, score-0.992]

28 Because the doublets were not marked in any way, subjects saw only a group of random shapes arranged on a grid. [sent-49, score-0.815]

29 The occurrence frequency of doublets and individual elements was equal across the set of scenes, allowing no obvious bias to remember any element more than others. [sent-50, score-0.56]

30 In the test phase a true and a mixture doublet were presented sequentially in each 2AFC trial. [sent-51, score-0.401]

31 The mixture combo was presented in a spatial position that had never appeared before. [sent-52, score-0.391]

32 In the previous experiment the elements of mixture doublets occurred together fewer times than elements of real doublets, thus a simple strategy based on tracking co-occurrence frequencies of shape-pairs would be sufficient to distinguish between them. [sent-54, score-0.704]

33 The second, frequency-balanced experiment tested whether humans are sensitive to higher-order statistics (at least cross-correlations, which are co-occurence frequencies normalized by respective invidual occurence frequencies). [sent-55, score-0.244]

34 The structure of Experiment 1 was changed so that while the 6 doublet combo architecture remained, their appearance frequency became non-uniform introducing frequent and rare combos. [sent-56, score-0.701]

35 Frequent doublets were presented twice as often as rare ones, so that certain mixture doublets consisting of shapes from frequent doublets appeared just as often as rare doublets. [sent-57, score-2.211]

36 Note, that the frequency of the constituent shapes of these mixture doublets was higher than that of rare doublets. [sent-58, score-1.029]

37 The training session consisted of 212 scenes, each scene being presented twice. [sent-59, score-0.221]

38 In the test phase, the familiarity of both single shapes and doublet combos was tested. [sent-60, score-0.968]

39 In the doublet trials, rare combos with low appearance frequency but high correlations between elements were compared to mixed combos with higher element and equal pair appearance frequency, but lower correlations between elements. [sent-61, score-1.357]

40 The third experiment tested whether human performance in this paradigm can be fully accounted for by learning cross-correlations. [sent-63, score-0.229]

41 Here, four triplet combos were formed and presented with equal occurrence frequencies. [sent-64, score-0.671]

42 In the first type, the familiarity of a true triplet and a mixture triplet was compared, while in the second type doublets consisting of adjacent shapes embedded in a triplet combo (embedded doublet) were tested against mixture doublets. [sent-67, score-1.755]

43 The fourth experiment compared directly how humans treat embedded and independent (non-embedded) combos of the same spatial dimensions. [sent-69, score-0.839]

44 Here two quadruplet combos and two doublet combos were defined and presented with equal frequency. [sent-70, score-1.258]

45 Each training scene consisted of six shapes, one quadruplet and one doublet. [sent-71, score-0.238]

46 First, true quadruplets were compared to mixture quadruplets; next, embedded doublets were compared to mixture doublets, finally true doublets were compared to mixture doublets. [sent-74, score-1.479]

47 3 Modeling framework The goal of Bayesian learning is to ‘reverse-engineer’ the generative model that could have generated the training data. [sent-75, score-0.21]

48 Because of inherent ambiguity and stochasticity assumed by the generative model itself, the objective is to establish a probability distribution over possible models. [sent-76, score-0.12]

49 3) will prefer the simplest model (in our case, the one with fewest parameters) that can effectively account for (generate) the training data due to the AOR effect in Bayesian model comparison [7]. [sent-78, score-0.134]

50 Sigmoid belief networks The class of generative models we consider is that of two-layer sigmoid belief networks (SBNs, Fig. [sent-79, score-0.226]

51 Observed variables are independent conditioned on the latents (i. [sent-83, score-0.161]

52 Although in general, space is certainly not a negligible factor in vision, human behavior in the present experiments depended on the fact of shape-appearances sufficiently strongly so that this simplification did not cause major confounds in our results. [sent-89, score-0.078]

53 A second difference between the model and the human experiments was that in the experiments, combos were not presented completely randomly, because the number of combos per scene was fixed (and not binomially distributed as implied by the model, Eq. [sent-90, score-1.189]

54 Nevertheless, our goal was to demonstrate the use of a general-purpose class of generative models, and although truly independent causes are rare in natural circumstances, always a fixed number of them being present is even more so. [sent-92, score-0.267]

55 Clearly, humans are able to capture dependences between latent variables, and these should be modeled as well ([12]). [sent-93, score-0.178]

56 Similarly, for simplicity we also ignored that subsequent scenes are rarely independent (Eq. [sent-94, score-0.15]

57 Training Establishing the posterior probability of any given model is straightforward using Bayes’ rule: P (wm , m|D) ∝ P (D|wm , m) P (wm , m) (4) where the first term is the likelihood of the model (Eq. [sent-96, score-0.091]

58 The prior over model structure preferred simple models and was such that the distributions of the number of latents and of the number of links conditioned on the number of latents were both Geometric (0. [sent-99, score-0.312]

59 The effect of this preference is ‘washed out’ with increasing training length as the likelihood term (Eq. [sent-101, score-0.095]

60 Note that when computing the probability of a test scene, we seek the probability that exactly the given scene was generated by the learned model. [sent-105, score-0.091]

61 This means that we require not only that all the shapes that are present in the test scene are present in the generated data, but also that all the shapes that are absent from the test scene are absent from the generated data. [sent-106, score-0.704]

62 A different scheme, in which only the presence but not the absence of the shapes need to be matched (i. [sent-107, score-0.22]

63 absent observeds are marginalized out just as latents are in Eq. [sent-109, score-0.194]

64 4 Results Pilot studies were performed with reduced training datasets in order to test the performance of the model learning framework. [sent-115, score-0.124]

65 First, we trained the model on data consisting of 8 observed variables (‘shapes’). [sent-116, score-0.087]

66 Avarage latent # −1 4 3 2 1 0 0 10 20 Training length 30 Figure 2: Bayesian learning in sigmoid belief networks. [sent-119, score-0.194]

67 Left panel: MAP model of a 30trial-long training with 8 observed variables and 3 combos. [sent-120, score-0.131]

68 Latent variables of the MAP model reflect the relationships defined by the combos. [sent-121, score-0.065]

69 Right panel: Increasing model complexity with increasing training experience. [sent-122, score-0.1]

70 Average number of latent variables (±SD) in the model posterior distribution as a function of the length of training data was obtained by marginalizing Eq. [sent-123, score-0.229]

71 sizes (5, 2, 1), two of which were presented simultaneously in each training trial. [sent-125, score-0.1]

72 The AOR effect in Bayesian model learning should select the model structure that is of just the right complexity for describing the data. [sent-126, score-0.123]

73 Accordingly, after 30 trials, the maximum a posteriori (MAP) model had three latents corresponding to the underlying ‘combos’ (Fig. [sent-127, score-0.136]

74 Early on in training simpler model structures dominated because of the prior preference for low latent and link numbers, but due to the simple structure of the training data the likelihood term won over in as few as 10 trials, and the model posterior converged to the true generative model (Fig. [sent-129, score-0.508]

75 On the other hand, if data was generated by using more ‘combos’ (4 ‘doublets’), model learning converged to a model with a correspondingly higher number of latents (Fig. [sent-132, score-0.194]

76 In the baseline experiment (Experiment 1) human subjects were trained with six equalsized doublet combos and were shown to recognize true doublets over mixture doublets (Fig. [sent-134, score-2.039]

77 When the same training data was used to compute the choice probability in 2AFC tests with model learning, true doublets were reliably preferred over mixture doublets. [sent-136, score-0.789]

78 Also, the MAP model showed that the discovered latent variables corresponded to the combos generating the training data (data not shown). [sent-137, score-0.682]

79 Bayesian model learning, as well as humans, could distinguish between rare doublet combos and mixtures from frequent doublets (Fig. [sent-139, score-1.423]

80 Furthermore, although in this comparison rare doublet combos were preferred, both humans and the model learned about the frequencies of their constituent shapes and preferred constituent single shapes of frequent doublets over those of rare doublets. [sent-141, score-2.324]

81 Nevertheless, it should be noted that while humans showed greater preference for frequent singlets than for rare doublets our simulations predicted an opposite trend1 . [sent-142, score-0.824]

82 We were interested whether the performance of humans could be fully accounted for by the learning of cross-correlations, or they demonstrated more sophisticated computations. [sent-143, score-0.174]

83 1 This discrepancy between theory and experiments may be explained by Gestalt effects in human vision that would strongly prefer the independent processing of constituent shapes due to their clear spatial separation in the training scenes. [sent-144, score-0.5]

84 Bars show percent ‘correct’ values (choosing a true or embedded combo over a mixture combo, or a frequent singlet over a rare singlet) for human experiments (average over subjects ±SEM), and ‘correct’ choice probabilities (Eq. [sent-147, score-0.82]

85 Sngls: Single shapes; dbls: Doublet combos; trpls: triplet combos; e’d dbls: embedded doublet combos; qpls: quadruple combos; idbls: independent doublet combos. [sent-149, score-0.746]

86 In Experiment 3, training data was composed of triplet combos, and beside testing true triplets against mixture triplets, we also tested embedded doublets (pairs of shapes from the same triplet) against mixture doublets (pairs of shapes from different triplets). [sent-150, score-2.095]

87 In contrast, human performace was significantly different for triplets (true triplets were preferred) and doublets (embedded and mixture doublets were not distinguished) (Fig. [sent-152, score-1.388]

88 This may be seen as Gestalt effects being at work: once the ‘whole’ triplet is learned, its constituent parts (the embedded doublets) loose their significance. [sent-154, score-0.323]

89 In other words, doublets were seen as mere noise under the MAP model. [sent-156, score-0.51]

90 The fourth experiment tested explicitly whether embedded combos and equal-sized independent real combos are distinguished and not only size effects prevented the recognition of embedded small structures in the previous experiment. [sent-157, score-1.323]

91 Both human experiments and Bayesian model selection demonstrated that quadruple combos as well as stand-alone doublets were reliably recognized (Fig. [sent-158, score-1.154]

92 5 Discussion We demonstrated that humans flexibly yet automatically learn complex generative models in visual perception. [sent-160, score-0.24]

93 Bayesian model learning has been implicated in several domains of high level human cognition, from causal reasoning [15] to concept learning [16]. [sent-161, score-0.16]

94 We emphasized the importance of learning the structure of the generative model, not only its parameters, even though it is quite clear that the two cannot be formally distinguished. [sent-163, score-0.141]

95 (1) Sigmoid belief networks identical to ours but without structure learning have been shown to perform poorly on a task closely related to ours [17], F¨ ldi´ k’s bar test o a [18]. [sent-165, score-0.1]

96 More complicated models will of course be able to produce identical results, but we think our model framework has the advantage of being intuitively simple: it seeks to find the simplest possible explanation for the data assuming that it was generated by independent causes. [sent-166, score-0.062]

97 This is computationally expensive, but together with the generative model class we use provides a neat and highly efficient way to discover ‘independent components’ in the data. [sent-168, score-0.12]

98 Our approach is very much in the tradition that sees the finding of independent causes behind sensory data as one of the major goals of perception [2]. [sent-170, score-0.135]

99 Very recently, several models have been proposed for doing inference in belief networks [20, 21] but parameter learning let alone structure learning proved to be non-trivial in them. [sent-172, score-0.124]

100 Our results highlight the importance of considering model structure learning in neural models of Bayesian inference. [sent-173, score-0.089]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('doublets', 0.51), ('combos', 0.476), ('combo', 0.221), ('doublet', 0.221), ('shapes', 0.22), ('dbls', 0.17), ('wm', 0.162), ('triplet', 0.133), ('scenes', 0.122), ('rare', 0.114), ('embedded', 0.109), ('triplets', 0.104), ('humans', 0.103), ('latents', 0.102), ('scene', 0.091), ('generative', 0.086), ('mixture', 0.082), ('constituent', 0.081), ('human', 0.078), ('latent', 0.075), ('experiment', 0.073), ('aslin', 0.068), ('fiser', 0.068), ('frequent', 0.068), ('training', 0.066), ('subjects', 0.059), ('comput', 0.059), ('wy', 0.059), ('bayesian', 0.052), ('panel', 0.052), ('aor', 0.051), ('familiarity', 0.051), ('idbls', 0.051), ('observeds', 0.051), ('qpls', 0.051), ('quadruplet', 0.051), ('sbns', 0.051), ('sngls', 0.051), ('trpls', 0.051), ('sigmoid', 0.05), ('belief', 0.045), ('gestalt', 0.044), ('quadruplets', 0.044), ('wij', 0.044), ('preferred', 0.043), ('absent', 0.041), ('courville', 0.04), ('causes', 0.039), ('frequencies', 0.039), ('yj', 0.039), ('sensory', 0.038), ('wx', 0.036), ('phase', 0.034), ('budapest', 0.034), ('quadruple', 0.034), ('rochester', 0.034), ('sbn', 0.034), ('singlet', 0.034), ('wxi', 0.034), ('wyj', 0.034), ('yb', 0.034), ('yz', 0.034), ('reproduced', 0.034), ('presented', 0.034), ('model', 0.034), ('structure', 0.031), ('variables', 0.031), ('consisted', 0.03), ('perception', 0.03), ('asked', 0.03), ('true', 0.03), ('experience', 0.03), ('razor', 0.03), ('preference', 0.029), ('tested', 0.029), ('visual', 0.029), ('independent', 0.028), ('occurrence', 0.028), ('dayan', 0.027), ('appeared', 0.027), ('occam', 0.027), ('spatial', 0.027), ('arranged', 0.026), ('ac', 0.025), ('psychol', 0.025), ('accounted', 0.025), ('percent', 0.025), ('map', 0.024), ('learning', 0.024), ('appearance', 0.024), ('tests', 0.024), ('ya', 0.024), ('ma', 0.023), ('fourth', 0.023), ('posterior', 0.023), ('arrangement', 0.023), ('demonstrated', 0.022), ('frequency', 0.022), ('consisting', 0.022), ('importantly', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 35 nips-2005-Bayesian model learning in human visual perception

Author: Gergő Orbán, Jozsef Fiser, Richard N. Aslin, Máté Lengyel

Abstract: Humans make optimal perceptual decisions in noisy and ambiguous conditions. Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. 1

2 0.077734232 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.

3 0.072636113 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

Author: Yan Karklin, Michael S. Lewicki

Abstract: Linear implementations of the efficient coding hypothesis, such as independent component analysis (ICA) and sparse coding models, have provided functional explanations for properties of simple cells in V1 [1, 2]. These models, however, ignore the non-linear behavior of neurons and fail to match individual and population properties of neural receptive fields in subtle but important ways. Hierarchical models, including Gaussian Scale Mixtures [3, 4] and other generative statistical models [5, 6], can capture higher-order regularities in natural images and explain nonlinear aspects of neural processing such as normalization and context effects [6,7]. Previously, it had been assumed that the lower level representation is independent of the hierarchy, and had been fixed when training these models. Here we examine the optimal lower-level representations derived in the context of a hierarchical model and find that the resulting representations are strikingly different from those based on linear models. Unlike the the basis functions and filters learned by ICA or sparse coding, these functions individually more closely resemble simple cell receptive fields and collectively span a broad range of spatial scales. Our work unifies several related approaches and observations about natural image structure and suggests that hierarchical models might yield better representations of image structure throughout the hierarchy.

4 0.06833525 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao

Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1

5 0.067672819 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

Author: Jean-pascal Pfister, Wulfram Gerstner

Abstract: While classical experiments on spike-timing dependent plasticity analyzed synaptic changes as a function of the timing of pairs of pre- and postsynaptic spikes, more recent experiments also point to the effect of spike triplets. Here we develop a mathematical framework that allows us to characterize timing based learning rules. Moreover, we identify a candidate learning rule with five variables (and 5 free parameters) that captures a variety of experimental data, including the dependence of potentiation and depression upon pre- and postsynaptic firing frequencies. The relation to the Bienenstock-Cooper-Munro rule as well as to some timing-based rules is discussed. 1

6 0.062939234 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

7 0.062799327 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

8 0.051727939 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

9 0.050896626 80 nips-2005-Gaussian Process Dynamical Models

10 0.050246585 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

11 0.050191741 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

12 0.049379718 34 nips-2005-Bayesian Surprise Attracts Human Attention

13 0.048138693 45 nips-2005-Conditional Visual Tracking in Kernel Space

14 0.047863707 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

15 0.047078755 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

16 0.045570318 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

17 0.045567997 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

18 0.045384247 98 nips-2005-Infinite latent feature models and the Indian buffet process

19 0.044883624 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

20 0.044083666 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.03), (2, 0.014), (3, 0.124), (4, 0.041), (5, -0.068), (6, 0.037), (7, -0.009), (8, -0.058), (9, -0.033), (10, 0.024), (11, 0.007), (12, 0.0), (13, 0.009), (14, 0.02), (15, 0.002), (16, -0.001), (17, 0.046), (18, 0.058), (19, 0.09), (20, -0.068), (21, -0.028), (22, 0.046), (23, -0.022), (24, 0.003), (25, 0.027), (26, -0.07), (27, -0.048), (28, 0.028), (29, 0.018), (30, -0.043), (31, 0.019), (32, 0.015), (33, -0.065), (34, -0.046), (35, 0.008), (36, -0.04), (37, -0.043), (38, 0.003), (39, -0.055), (40, 0.061), (41, 0.084), (42, -0.065), (43, 0.191), (44, 0.077), (45, -0.095), (46, 0.054), (47, -0.097), (48, -0.032), (49, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90170187 35 nips-2005-Bayesian model learning in human visual perception

Author: Gergő Orbán, Jozsef Fiser, Richard N. Aslin, Máté Lengyel

Abstract: Humans make optimal perceptual decisions in noisy and ambiguous conditions. Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. 1

2 0.60231429 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

Author: Hongjing Lu, Alan L. Yuille

Abstract: We derive a Bayesian Ideal Observer (BIO) for detecting motion and solving the correspondence problem. We obtain Barlow and Tripathy’s classic model as an approximation. Our psychophysical experiments show that the trends of human performance are similar to the Bayesian Ideal, but overall human performance is far worse. We investigate ways to degrade the Bayesian Ideal but show that even extreme degradations do not approach human performance. Instead we propose that humans perform motion tasks using generic, general purpose, models of motion. We perform more psychophysical experiments which are consistent with humans using a Slow-and-Smooth model and which rule out an alternative model using Slowness. 1

3 0.55152911 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: We extend a previously developed Bayesian framework for perception to account for sensory adaptation. We first note that the perceptual effects of adaptation seems inconsistent with an adjustment of the internally represented prior distribution. Instead, we postulate that adaptation increases the signal-to-noise ratio of the measurements by adapting the operational range of the measurement stage to the input range. We show that this changes the likelihood function in such a way that the Bayesian estimator model can account for reported perceptual behavior. In particular, we compare the model’s predictions to human motion discrimination data and demonstrate that the model accounts for the commonly observed perceptual adaptation effects of repulsion and enhanced discriminability. 1 Motivation A growing number of studies support the notion that humans are nearly optimal when performing perceptual estimation tasks that require the combination of sensory observations with a priori knowledge. The Bayesian formulation of these problems defines the optimal strategy, and provides a principled yet simple computational framework for perception that can account for a large number of known perceptual effects and illusions, as demonstrated in sensorimotor learning [1], cue combination [2], or visual motion perception [3], just to name a few of the many examples. Adaptation is a fundamental phenomenon in sensory perception that seems to occur at all processing levels and modalities. A variety of computational principles have been suggested as explanations for adaptation. Many of these are based on the concept of maximizing the sensory information an observer can obtain about a stimulus despite limited sensory resources [4, 5, 6]. More mechanistically, adaptation can be interpreted as the attempt of the sensory system to adjusts its (limited) dynamic range such that it is maximally informative with respect to the statistics of the stimulus. A typical example is observed in the retina, which manages to encode light intensities that vary over nine orders of magnitude using ganglion cells whose dynamic range covers only two orders of magnitude. This is achieved by adapting to the local mean as well as higher order statistics of the visual input over short time-scales [7]. ∗ corresponding author. If a Bayesian framework is to provide a valid computational explanation of perceptual processes, then it needs to account for the behavior of a perceptual system, regardless of its adaptation state. In general, adaptation in a sensory estimation task seems to have two fundamental effects on subsequent perception: • Repulsion: The estimate of parameters of subsequent stimuli are repelled by those of the adaptor stimulus, i.e. the perceived values for the stimulus variable that is subject to the estimation task are more distant from the adaptor value after adaptation. This repulsive effect has been reported for perception of visual speed (e.g. [8, 9]), direction-of-motion [10], and orientation [11]. • Increased sensitivity: Adaptation increases the observer’s discrimination ability around the adaptor (e.g. for visual speed [12, 13]), however it also seems to decrease it further away from the adaptor as shown in the case of direction-of-motion discrimination [14]. In this paper, we show that these two perceptual effects can be explained within a Bayesian estimation framework of perception. Note that our description is at an abstract functional level - we do not attempt to provide a computational model for the underlying mechanisms responsible for adaptation, and this clearly separates this paper from other work which might seem at first glance similar [e.g., 15]. 2 Adaptive Bayesian estimator framework Suppose that an observer wants to estimate a property of a stimulus denoted by the variable θ, based on a measurement m. In general, the measurement can be vector-valued, and is corrupted by both internal and external noise. Hence, combining the noisy information gained by the measurement m with a priori knowledge about θ is advantageous. According to Bayes’ rule 1 p(θ|m) = p(m|θ)p(θ) . (1) α That is, the probability of stimulus value θ given m (posterior) is the product of the likelihood p(m|θ) of the particular measurement and the prior p(θ). The normalization constant α serves to ensure that the posterior is a proper probability distribution. Under the assumpˆ tion of a squared-error loss function, the optimal estimate θ(m) is the mean of the posterior, thus ∞ ˆ θ(m) = θ p(θ|m) dθ . (2) 0 ˆ Note that θ(m) describes an estimate for a single measurement m. As discussed in [16], the measurement will vary stochastically over the course of many exposures to the same stimulus, and thus the estimator will also vary. We return to this issue in Section 3.2. Figure 1a illustrates a Bayesian estimator, in which the shape of the (arbitrary) prior distribution leads on average to a shift of the estimate toward a lower value of θ than the true stimulus value θstim . The likelihood and the prior are the fundamental constituents of the Bayesian estimator model. Our goal is to describe how adaptation alters these constituents so as to account for the perceptual effects of repulsion and increased sensitivity. Adaptation does not change the prior ... An intuitively sensible hypothesis is that adaptation changes the prior distribution. Since the prior is meant to reflect the knowledge the observer has about the distribution of occurrences of the variable θ in the world, repeated viewing of stimuli with the same parameter a b probability probability attraction ! posterior likelihood prior modified prior à θ θ ˆ θ' θ θadapt Figure 1: Hypothetical model in which adaptation alters the prior distribution. a) Unadapted Bayesian estimation configuration in which the prior leads to a shift of the estimate ˆ θ, relative to the stimulus parameter θstim . Both the likelihood function and the prior distriˆ bution contribute to the exact value of the estimate θ (mean of the posterior). b) Adaptation acts by increasing the prior distribution around the value, θadapt , of the adapting stimulus ˆ parameter. Consequently, an subsequent estimate θ of the same stimulus parameter value θstim is attracted toward the adaptor. This is the opposite of observed perceptual effects, and we thus conclude that adjustments of the prior in a Bayesian model do not account for adaptation. value θadapt should presumably increase the prior probability in the vicinity of θadapt . Figure 1b schematically illustrates the effect of such a change in the prior distribution. The estimated (perceived) value of the parameter under the adapted condition is attracted to the adapting parameter value. In order to account for observed perceptual repulsion effects, the prior would have to decrease at the location of the adapting parameter, a behavior that seems fundamentally inconsistent with the notion of a prior distribution. ... but increases the reliability of the measurements Since a change in the prior distribution is not consistent with repulsion, we are led to the conclusion that adaptation must change the likelihood function. But why, and how should this occur? In order to answer this question, we reconsider the functional purpose of adaptation. We assume that adaptation acts to allocate more resources to the representation of the parameter values in the vicinity of the adaptor [4], resulting in a local increase in the signal-to-noise ratio (SNR). This can be accomplished, for example, by dynamically adjusting the operational range to the statistics of the input. This kind of increased operational gain around the adaptor has been effectively demonstrated in the process of retinal adaptation [17]. In the context of our Bayesian estimator framework, and restricting to the simple case of a scalar-valued measurement, adaptation results in a narrower conditional probability density p(m|θ) in the immediate vicinity of the adaptor, thus an increase in the reliability of the measurement m. This is offset by a broadening of the conditional probability density p(m|θ) in the region beyond the adaptor vicinity (we assume that total resources are conserved, and thus an increase around the adaptor must necessarily lead to a decrease elsewhere). Figure 2 illustrates the effect of this local increase in signal-to-noise ratio on the likeli- unadapted adapted θadapt p(m2| θ )' 1/SNR θ θ θ1 θ2 θ1 p(m2|θ) θ2 θ m2 p(m1| θ )' m1 m m p(m1|θ) θ θ θ θadapt p(m| θ2)' p(m|θ2) likelihoods p(m|θ1) p(m| θ1)' p(m|θadapt )' conditionals Figure 2: Measurement noise, conditionals and likelihoods. The two-dimensional conditional density, p(m|θ), is shown as a grayscale image for both the unadapted and adapted cases. We assume here that adaptation increases the reliability (SNR) of the measurement around the parameter value of the adaptor. This is balanced by a decrease in SNR of the measurement further away from the adaptor. Because the likelihood is a function of θ (horizontal slices, shown plotted at right), this results in an asymmetric change in the likelihood that is in agreement with a repulsive effect on the estimate. a b ^ ∆θ ^ ∆θ [deg] + 0 60 30 0 -30 - θ θ adapt -60 -180 -90 90 θadapt 180 θ [deg] Figure 3: Repulsion: Model predictions vs. human psychophysics. a) Difference in perceived direction in the pre- and post-adaptation condition, as predicted by the model. Postadaptive percepts of motion direction are repelled away from the direction of the adaptor. b) Typical human subject data show a qualitatively similar repulsive effect. Data (and fit) are replotted from [10]. hood function. The two gray-scale images represent the conditional probability densities, p(m|θ), in the unadapted and the adapted state. They are formed by assuming additive noise on the measurement m of constant variance (unadapted) or with a variance that decreases symmetrically in the vicinity of the adaptor parameter value θadapt , and grows slightly in the region beyond. In the unadapted state, the likelihood is convolutional and the shape and variance are equivalent to the distribution of measurement noise. However, in the adapted state, because the likelihood is a function of θ (horizontal slice through the conditional surface) it is no longer convolutional around the adaptor. As a result, the mean is pushed away from the adaptor, as illustrated in the two graphs on the right. Assuming that the prior distribution is fairly smooth, this repulsion effect is transferred to the posterior distribution, and thus to the estimate. 3 Simulation Results We have qualitatively demonstrated that an increase in the measurement reliability around the adaptor is consistent with the repulsive effects commonly seen as a result of perceptual adaptation. In this section, we simulate an adapted Bayesian observer by assuming a simple model for the changes in signal-to-noise ratio due to adaptation. We address both repulsion and changes in discrimination threshold. In particular, we compare our model predictions with previously published data from psychophysical experiments examining human perception of motion direction. 3.1 Repulsion In the unadapted state, we assume the measurement noise to be additive and normally distributed, and constant over the whole measurement space. Thus, assuming that m and θ live in the same space, the likelihood is a Gaussian of constant width. In the adapted state, we assume a simple functional description for the variance of the measurement noise around the adapter. Specifically, we use a constant plus a difference of two Gaussians, a b relative discrimination threshold relative discrimination threshold 1.8 1 θ θadapt 1.6 1.4 1.2 1 0.8 -40 -20 θ adapt 20 40 θ [deg] Figure 4: Discrimination thresholds: Model predictions vs. human psychophysics. a) The model predicts that thresholds for direction discrimination are reduced at the adaptor. It also predicts two side-lobes of increased threshold at further distance from the adaptor. b) Data of human psychophysics are in qualitative agreement with the model. Data are replotted from [14] (see also [11]). each having equal area, with one twice as broad as the other (see Fig. 2). Finally, for simplicity, we assume a flat prior, but any reasonable smooth prior would lead to results that are qualitatively similar. Then, according to (2) we compute the predicted estimate of motion direction in both the unadapted and the adapted case. Figure 3a shows the predicted difference between the pre- and post-adaptive average estimate of direction, as a function of the stimulus direction, θstim . The adaptor is indicated with an arrow. The repulsive effect is clearly visible. For comparison, Figure 3b shows human subject data replotted from [10]. The perceived motion direction of a grating was estimated, under both adapted and unadapted conditions, using a two-alternative-forced-choice experimental paradigm. The plot shows the change in perceived direction as a function of test stimulus direction relative to that of the adaptor. Comparison of the two panels of Figure 3 indicate that despite the highly simplified construction of the model, the prediction is quite good, and even includes the small but consistent repulsive effects observed 180 degrees from the adaptor. 3.2 Changes in discrimination threshold Adaptation also changes the ability of human observers to discriminate between the direction of two different moving stimuli. In order to model discrimination thresholds, we need to consider a Bayesian framework that can account not only for the mean of the estimate but also its variability. We have recently developed such a framework, and used it to quantitatively constrain the likelihood and the prior from psychophysical data [16]. This framework accounts for the effect of the measurement noise on the variability of the ˆ ˆ estimate θ. Specifically, it provides a characterization of the distribution p(θ|θstim ) of the estimate for a given stimulus direction in terms of its expected value and its variance as a function of the measurement noise. As in [16] we write ˆ ∂ θ(m) 2 ˆ var θ|θstim = var m ( ) |m=θstim . (3) ∂m Assuming that discrimination threshold is proportional to the standard deviation, ˆ var θ|θstim , we can now predict how discrimination thresholds should change after adaptation. Figure 4a shows the predicted change in discrimination thresholds relative to the unadapted condition for the same model parameters as in the repulsion example (Figure 3a). Thresholds are slightly reduced at the adaptor, but increase symmetrically for directions further away from the adaptor. For comparison, Figure 4b shows the relative change in discrimination thresholds for a typical human subject [14]. Again, the behavior of the human observer is qualitatively well predicted. 4 Discussion We have shown that adaptation can be incorporated into a Bayesian estimation framework for human sensory perception. Adaptation seems unlikely to manifest itself as a change in the internal representation of prior distributions, as this would lead to perceptual bias effects that are opposite to those observed in human subjects. Instead, we argue that adaptation leads to an increase in reliability of the measurement in the vicinity of the adapting stimulus parameter. We show that this change in the measurement reliability results in changes of the likelihood function, and that an estimator that utilizes this likelihood function will exhibit the commonly-observed adaptation effects of repulsion and changes in discrimination threshold. We further confirm our model by making quantitative predictions and comparing them with known psychophysical data in the case of human perception of motion direction. Many open questions remain. The results demonstrated here indicate that a resource allocation explanation is consistent with the functional effects of adaptation, but it seems unlikely that theory alone can lead to a unique quantitative prediction of the detailed form of these effects. Specifically, the constraints imposed by biological implementation are likely to play a role in determining the changes in measurement noise as a function of adaptor parameter value, and it will be important to characterize and interpret neural response changes in the context of our framework. Also, although we have argued that changes in the prior seem inconsistent with adaptation effects, it may be that such changes do occur but are offset by the likelihood effect, or occur only on much longer timescales. Last, if one considers sensory perception as the result of a cascade of successive processing stages (with both feedforward and feedback connections), it becomes necessary to expand the Bayesian description to describe this cascade [e.g., 18, 19]. For example, it may be possible to interpret this cascade as a sequence of Bayesian estimators, in which the measurement of each stage consists of the estimate computed at the previous stage. Adaptation could potentially occur in each of these processing stages, and it is of fundamental interest to understand how such a cascade can perform useful stable computations despite the fact that each of its elements is constantly readjusting its response properties. References [1] K. K¨ rding and D. Wolpert. Bayesian integration in sensorimotor learning. o 427(15):244–247, January 2004. Nature, [2] D C Knill and W Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [3] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [4] H.B. Barlow. Vision: Coding and Efficiency, chapter A theory about the functional role and synaptic mechanism of visual after-effects, pages 363–375. Cambridge University Press., 1990. [5] M.J. Wainwright. Visual adaptation as optimal information transmission. Vision Research, 39:3960–3974, 1999. [6] N. Brenner, W. Bialek, and R. de Ruyter van Steveninck. Adaptive rescaling maximizes information transmission. Neuron, 26:695–702, June 2000. [7] S.M. Smirnakis, M.J. Berry, D.K. Warland, W. Bialek, and M. Meister. Adaptation of retinal processing to image contrast and spatial scale. Nature, 386:69–73, March 1997. [8] P. Thompson. Velocity after-effects: the effects of adaptation to moving stimuli on the perception of subsequently seen moving stimuli. Vision Research, 21:337–345, 1980. [9] A.T. Smith. Velocity coding: evidence from perceived velocity shifts. Vision Research, 25(12):1969–1976, 1985. [10] P. Schrater and E. Simoncelli. Local velocity representation: evidence from motion adaptation. Vision Research, 38:3899–3912, 1998. [11] C.W. Clifford. Perceptual adaptation: motion parallels orientation. Trends in Cognitive Sciences, 6(3):136–143, March 2002. [12] C. Clifford and P. Wenderoth. Adaptation to temporal modulaton can enhance differential speed sensitivity. Vision Research, 39:4324–4332, 1999. [13] A. Kristjansson. Increased sensitivity to speed changes during adaptation to first-order, but not to second-order motion. Vision Research, 41:1825–1832, 2001. [14] R.E. Phinney, C. Bowd, and R. Patterson. Direction-selective coding of stereoscopic (cyclopean) motion. Vision Research, 37(7):865–869, 1997. [15] N.M. Grzywacz and R.M. Balboa. A Bayesian framework for sensory adaptation. Neural Computation, 14:543–559, 2002. [16] A.A. Stocker and E.P. Simoncelli. Constraining a Bayesian model of human visual speed perception. In Lawrence K. Saul, Yair Weiss, and L´ on Bottou, editors, Advances in Neural Infore mation Processing Systems NIPS 17, pages 1361–1368, Cambridge, MA, 2005. MIT Press. [17] D. Tranchina, J. Gordon, and R.M. Shapley. Retinal light adaptation – evidence for a feedback mechanism. Nature, 310:314–316, July 1984. ´ [18] S. Deneve. Bayesian inference in spiking neurons. In Lawrence K. Saul, Yair Weiss, and L eon Bottou, editors, Adv. Neural Information Processing Systems (NIPS*04), vol 17, Cambridge, MA, 2005. MIT Press. [19] R. Rao. Hierarchical Bayesian inference in networks of spiking neurons. In Lawrence K. Saul, Yair Weiss, and L´ on Bottou, editors, Adv. Neural Information Processing Systems (NIPS*04), e vol 17, Cambridge, MA, 2005. MIT Press.

4 0.50844121 4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

5 0.507631 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao

Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1

6 0.50383806 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

7 0.50168139 34 nips-2005-Bayesian Surprise Attracts Human Attention

8 0.47135496 156 nips-2005-Prediction and Change Detection

9 0.47075689 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

10 0.42459035 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

11 0.40635651 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

12 0.39555565 80 nips-2005-Gaussian Process Dynamical Models

13 0.39282942 45 nips-2005-Conditional Visual Tracking in Kernel Space

14 0.3754493 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

15 0.37298426 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

16 0.37276676 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

17 0.35943997 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

18 0.3573592 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

19 0.34873053 98 nips-2005-Infinite latent feature models and the Indian buffet process

20 0.34484774 37 nips-2005-Benchmarking Non-Parametric Statistical Tests


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.035), (10, 0.031), (11, 0.021), (27, 0.053), (31, 0.049), (34, 0.06), (39, 0.041), (41, 0.011), (55, 0.014), (65, 0.022), (69, 0.056), (73, 0.041), (80, 0.337), (88, 0.095), (91, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94907516 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes

Author: Xuerui Wang, Natasha Mohanty, Andrew McCallum

Abstract: We present a probabilistic generative model of entity relationships and their attributes that simultaneously discovers groups among the entities and topics among the corresponding textual attributes. Block-models of relationship data have been studied in social network analysis for some time. Here we simultaneously cluster in several modalities at once, incorporating the attributes (here, words) associated with certain relationships. Significantly, joint inference allows the discovery of topics to be guided by the emerging groups, and vice-versa. We present experimental results on two large data sets: sixteen years of bills put before the U.S. Senate, comprising their corresponding text and voting records, and thirteen years of similar data from the United Nations. We show that in comparison with traditional, separate latent-variable models for words, or Blockstructures for votes, the Group-Topic model’s joint inference discovers more cohesive groups and improved topics. 1

2 0.74404013 52 nips-2005-Correlated Topic Models

Author: John D. Lafferty, David M. Blei

Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1

same-paper 3 0.72502041 35 nips-2005-Bayesian model learning in human visual perception

Author: Gergő Orbán, Jozsef Fiser, Richard N. Aslin, Máté Lengyel

Abstract: Humans make optimal perceptual decisions in noisy and ambiguous conditions. Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. 1

4 0.44744039 48 nips-2005-Context as Filtering

Author: Daichi Mochihashi, Yuji Matsumoto

Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1

5 0.41184443 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.

6 0.41056201 45 nips-2005-Conditional Visual Tracking in Kernel Space

7 0.40625459 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

8 0.40428817 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

9 0.40409601 144 nips-2005-Off-policy Learning with Options and Recognizers

10 0.40248865 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

11 0.40247655 136 nips-2005-Noise and the two-thirds power Law

12 0.39805621 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

13 0.39785999 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

14 0.39728037 169 nips-2005-Saliency Based on Information Maximization

15 0.39715335 112 nips-2005-Learning Minimum Volume Sets

16 0.3968949 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

17 0.39645544 171 nips-2005-Searching for Character Models

18 0.39640692 74 nips-2005-Faster Rates in Regression via Active Learning

19 0.3959322 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

20 0.39548436 23 nips-2005-An Application of Markov Random Fields to Range Sensing