nips nips2007 nips2007-145 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. [sent-2, score-0.362]
2 Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. [sent-4, score-0.553]
3 1 Introduction Computational models of visual cortex, and in particular those based on sparse coding, have recently enjoyed much attention. [sent-5, score-0.285]
4 The basic assumption behind sparse coding is that natural scenes are composed of structural primitives (edges or lines, for example) and, although there are a potentially large number of these primitives, typically only a few are active in a single natural scene (hence the term sparse, [1, 2]). [sent-6, score-1.305]
5 The claim is that cortical processing uses these statistical regularities to shape a representation of natural scenes, and in particular converts the pixel-based representation at the retina to a higher-level representation in terms of these structural primitives. [sent-7, score-0.241]
6 Traditionally, research has focused on determining the characteristics of the structural primitives and comparing their representational properties with those of V1. [sent-8, score-0.482]
7 The two we focus on here are: How large is the set of structural primitives best suited to describe all natural scenes (how over-complete), and how many primitives are active in a single scene (how sparse)? [sent-10, score-1.159]
8 We will also be interested in the coupling between sparseness and over-completeness. [sent-11, score-0.4]
9 The intuition is that, if there are a great number of structural primitives, they can be very specific and only a small number will be active in a visual scene. [sent-12, score-0.204]
10 We attempt to map this coupling by evaluating models with different over-completenesses and sparsenesses and discover where natural scenes live along this trade-off (see Fig. [sent-14, score-0.504]
11 In order to test the sparse coding hypothesis it is necessary to build algorithms that both learn the primitives and decompose natural scenes in terms of them. [sent-16, score-0.944]
12 There have been many ways to derive such algorithms, but one of the more successful is to regard the task of building a representation of natural scenes as one of probabilistic inference. [sent-17, score-0.287]
13 More specifically, the unknown activities of the structural primitives are viewed as latent variables that must be inferred from the natural scene data. [sent-18, score-0.786]
14 Commonly the inference is carried out by writing down a generative model (although see [3] for an alternative), which formalises the assumptions made about the data and latent variables. [sent-19, score-0.196]
15 Unfortunately the assumption that natural scenes are composed of a small number of structural primitives is not sufficient to build a meaningful generative model. [sent-21, score-0.82]
16 Other assumptions must therefore be made and typically these are that the primitives occur independently, and combine linearly. [sent-22, score-0.321]
17 These 1 10 overcompleteness 8 6 4 2 0 sparsity Figure 1: Schematic showing the space of possible sparse coding models in terms of sparseness (increasing in the direction of the arrow) and over-completeness. [sent-23, score-0.966]
18 via their marginal likelihood or cross-validation) and the grey contours illustrate what we might expect to discover if this were possible: The solid black line illustrates the hypothesised trade-off between over-completeness and sparsity, whilst the star shows the optimal point in this trade-off. [sent-27, score-0.261]
19 are drastic approximations and it is an open question to what extent this affects the results of sparse coding. [sent-28, score-0.181]
20 The distribution over the latent variables xt,k is chosen to be sparse and typical choices are Student-t, a Mixture of Gaussians (with zero means), and the Generalised Gaussian (which includes the Laplace distribution). [sent-29, score-0.275]
21 (2) The goal of this paper will be to learn the optimal dimensionality of the latent variables (K) and the optimal sparseness of the prior (α). [sent-31, score-0.644]
22 In fact, if the hypothesis is that the visual system is implementing an optimal generative model, then questions of over-completeness and sparsity should be addressed in this context. [sent-37, score-0.303]
23 Finally, we present results concerning the optimal sparseness and over-completeness for natural image patches in the case that the prior is a Student-t distribution. [sent-40, score-0.725]
24 Here, we focus on the Student-t prior for the latent variables xt,k : Γ α+1 p(xt,k |α, λ) = √ 2 λ απ Γ α 2 1 1+ α xt,k λ 2 − α+1 2 (3) There are two main reasons for this choice: The first is that this is a widely used model [1]. [sent-42, score-0.203]
25 The second is that by implementing the Student-t prior using an auxiliary variable, all the distributions in the generative model become members of the exponential family [5]. [sent-43, score-0.171]
26 This means it is easy to derive efficient approximate inference schemes like variational Bayes and Gibbs sampling. [sent-44, score-0.133]
27 These nonzero elements form star-like patterns, where the points of the star are determined by the direction of the weights (e. [sent-48, score-0.129]
28 One of the major technical difficulties posed by sparse-coding is that, in the over-complete regime, the posterior distribution of the latent variables p(X|Y, θ) is often complex and multi-modal. [sent-52, score-0.144]
29 Approximation schemes are therefore required, but we must be careful to ensure that the scheme we choose does not bias the conclusions we are trying to draw. [sent-53, score-0.137]
30 This is true for any application of sparse coding, but is particularly pertinent for our problem as we will be quantitatively comparing different sparse-coding models. [sent-54, score-0.272]
31 However, an alternative is to learn the optimal overcompleteness at a given sparseness using automatic relevance determination (ARD) [7, 8]. [sent-67, score-0.662]
32 In a nutshell, the idea is to equip the model with many more components than are believed to be present in the data, and to let it prune out the weights which are unnecessary. [sent-71, score-0.15]
33 Practically this involves placing a (Gaussian) prior over the components which favours small weights, and then inferring the scale of this prior. [sent-72, score-0.176]
34 4 (13) Determining the over-completeness: Variational Bayes In the previous two sections we described a generative model for sparse coding that is theoretically able to learn the optimal over-completeness of natural scenes. [sent-75, score-0.621]
35 We have two distinct uses for this model: The first, and computationally more demanding task, is to learn the over-completeness at a variety of different, fixed, sparsenesses (that is, to find the optimal over-completeness in a vertical slice through Fig. [sent-76, score-0.148]
36 1); The second is to determine the optimal point on this trade-off by evaluating the (approximate) marginal-likelihood (that is, evaluating points along the trade-off line in Fig. [sent-77, score-0.148]
37 The first approximation scheme is Variational Bayes (VB), and it excels at the first task, but is severely biased in the case of the second. [sent-80, score-0.109]
38 In particular if we assume that the set of parameters and set of latent variables are independent in the posterior, so that q(V, Θ) = q(V )q(Θ) then we can sequentially optimise the free-energy with respect to each of these distributions. [sent-88, score-0.139]
39 The difference between the free energy and the true likelihood is given by the KL divergence between the approximate and true posterior. [sent-95, score-0.278]
40 At lowsparsities the prior becomes Gaussian-like and the posterior also becomes a uni-modal Gaussian. [sent-99, score-0.119]
41 One might also be concerned with another potential source of bias: The number of modes in the posterior increases with the number of components in the model, which gives a worse match to the variational approximation for more over-complete models. [sent-102, score-0.218]
42 However, because of the sparseness of the prior distribution, most of the modes are going to be very shallow for typical inputs, so that this effect should be small. [sent-103, score-0.406]
43 One alternative might be to use a variational method which has a multi-modal approximating distribution (e. [sent-109, score-0.128]
44 Briefly, this inner loop starts by drawing samples from the model’s prior distribution and continues to sample as the prior is deformed into the posterior, according to an annealing schedule. [sent-115, score-0.261]
45 6 Results Before tackling natural images, it is necessary to verify that the approximations can discover the correct degree of over-completeness and sparsity in the case where the data are drawn from the forward model. [sent-118, score-0.36]
46 1 Verification using simple artifical data In the first experiment the training data are produced as follows: Two-dimensional observations are generated by three Student-t sources with degree of freedom chosen to be 2. [sent-121, score-0.116]
47 The generative weights are fixed to be 60 degrees apart from one another, as shown in Figure 2. [sent-123, score-0.132]
48 A series of VB simulations were then run, differing only in the sparseness level (as measured by the degrees of freedom of the Student-t distribution over xt ). [sent-124, score-0.56]
49 Each simulation consisted of 500 VB iterations performed on a set of 3000 data points randomly generated from the model. [sent-125, score-0.148]
50 We initialised the simulations with K = 7 components. [sent-126, score-0.182]
51 To improve convergence, we started the simulations with weights near the origin (drawn from a normal distribution with mean 0 and standard deviation 10−8 ) and a relatively large input noise variance, and annealed the noise variance between the iterations 2 of VBEM. [sent-127, score-0.438]
52 The annealing schedule was as following: we started with σy = 0. [sent-128, score-0.269]
53 During the annealing process, the weights typically grew from the origin and spread in all directions to cover the input space. [sent-132, score-0.311]
54 At the same time, the corresponding precision hyperparameters grew and effectively pruned the unnecessary components. [sent-134, score-0.171]
55 We performed 7 blocks of simulations at different sparseness levels. [sent-135, score-0.462]
56 In every block we performed 3 runs of the algorithm and retained the result with the highest free energy. [sent-136, score-0.241]
57 5 −8000 −6500 −8200 −6550 log P(Y) (NATS) free energy (NATS) 4 −8400 −8600 −8800 −9000 −9200 2 0 2 −9400 −6600 −6650 −6700 −6750 −6800 −9600 4 2. [sent-137, score-0.24]
58 We derived the importance weights using a fixed data set with 2500 data points, 250 samples, and 300 intermediate distributions. [sent-154, score-0.14]
59 Following the recommendations in [9], the annealing schedule was chosen to be linear initially (with 50 inverse temperatures spaced uniformly from 0 to 0. [sent-155, score-0.302]
60 The results indicate that the combination of the two methods is successful at learning both the overcompleteness and sparseness. [sent-159, score-0.235]
61 In particular the VBEM algorithm was able to recover the correct dimensionality for all sparseness levels, except for the sparsest case α = 2. [sent-160, score-0.337]
62 As expected, however, figure 2 shows that the maximum free energy is biased toward the more Gaussian models. [sent-162, score-0.297]
63 2), which is strictly greater than the free-energy as expected, favours sparseness levels close to the true value. [sent-164, score-0.52]
64 2 Verification using complex artificial data Although it is necessary that the inference scheme should pass simple tests like that in the previous section, they are not sufficient to give us confidence that it will perform successfully on natural data. [sent-166, score-0.181]
65 One pertinent criticism is that the regime in which we tested the algorithms in the previous section (two dimensional observations, and three hidden latents) is quite different from that required to model natural data. [sent-167, score-0.325]
66 To that end, in this section we first learn a sparse model for natural images with fixed over-completeness levels using a Maximum A Posteriori (MAP) algorithm [2] (degree of freedom 2. [sent-168, score-0.607]
67 The goal is to validate the model on data which has a content and scale similar to the natural images case, but with a controlled number of generative components. [sent-171, score-0.308]
68 The image data comprised patches of size 9 × 9 pixels, taken at random positions from 36 natural images randomly selected from the van Hateren database (preprocessed as described in [10]). [sent-172, score-0.389]
69 The MAP solution was trained for 500 iterations, with every iteration performed on a new batch of 1440 patches (100 patches per image). [sent-174, score-0.244]
70 The model was initialised with a 3-times over-complete number of components (K = 108). [sent-175, score-0.187]
71 As above, the weights were initialised near the origin, and the input noise was annealed linearly from σd = 0. [sent-176, score-0.301]
72 Every run consisted of 500 VBEM iterations, with every iteration performed on 3600 patches generated from the MAP solution. [sent-179, score-0.184]
73 We performed several simulations for over-completeness levels between 0. [sent-180, score-0.203]
74 5, and retained the solutions with the highest free energy. [sent-182, score-0.191]
75 5 True Overcompleteness Figure 3: True versus inferred over-completeness from data drawn from the forward model trained on natural images. [sent-195, score-0.26]
76 This straight line saturates when we hit the number of latent variables with which ARD was initialised (three times overcomplete). [sent-197, score-0.251]
77 The results using multiple runs of ARD are close to this line (open circles, simulations with the highest free-energy are shown as closed circles). [sent-198, score-0.115]
78 The maximal and best over-completeness inferred from natural scenes is shown by the dotted line, and lies well below the over-completenesses we are able to infer. [sent-199, score-0.34]
79 values are still far above the highest over-completeness learned from natural images (see section 6. [sent-200, score-0.246]
80 3 Natural images Having established that the model performs as expected, at least when the data is drawn from the forward model, we now turn to natural image data and examine the optimal over-completeness ratio and sparseness degree for natural scene statistics. [sent-203, score-0.97]
81 The image data for this simulation and the model initialisation and annealing procedure are identical to the ones in the experiments in the preceeding section. [sent-204, score-0.202]
82 We performed 20 simulations with different sparseness levels, especially concentrated on the more sparse values. [sent-205, score-0.643]
83 As shown in Figure 4, the free energy increased almost monotonically until α = 5 and then stabilised and started to decrease for more Gaussian models. [sent-207, score-0.259]
84 3, with a trend for being more over-complete at high sparseness levels (Fig. [sent-209, score-0.46]
85 Although this general trend accords with the intuition that sparseness and over-completeness are coupled, both the magnitude of the effect and the degree of over-completeness is smaller than might have been anticipated. [sent-211, score-0.469]
86 Finally we performed AIS using the same annealing schedule as in Section 6. [sent-213, score-0.262]
87 1, using 250 samples for the first 6 sparseness levels and 50 for the successive 14. [sent-214, score-0.415]
88 The estimates obtained for the log marginal likelihood, shown in Figure 4, were monotonically increasing with increasing sparseness (decreasing α). [sent-215, score-0.437]
89 This indicates that sparse models are indeed optimal for natural scenes. [sent-216, score-0.409]
90 Note that this is exactly the opposite trend to that of the free energy, indicating that it is also biased for natural scenes. [sent-217, score-0.339]
91 The weights resemble the Gabor wavelets, typical of sparse codes for natural images [1]. [sent-220, score-0.457]
92 7 Discussion Our results suggest that the optimal sparse-coding model for natural scenes is indeed one which is very sparse, but only modestly over-complete. [sent-221, score-0.493]
93 The anticipated coupling between the degree of sparsity and the over-completeness in the model is visible, but is weak. [sent-222, score-0.247]
94 One crucial question is how far these results will generalise to other prior distributions; and indeed, which of the various possible sparse-coding priors is best able to capture the structure of natural scenes. [sent-223, score-0.198]
95 a) Free energy b) Marginal likelihood c) Estimated over-completeness d) Basis vectors freedom parameter moves towards sparser values. [sent-239, score-0.16]
96 It is not clear that data with such extreme values will be encountered in typical data sets, and so the model may become distorted at high sparseness values. [sent-241, score-0.377]
97 Emergence of simple-cell receptive field properties by learning a sparse code for natural images. [sent-252, score-0.31]
98 Sparse coding with an overcomplete basis set: A strategy employed by V1? [sent-259, score-0.186]
99 The ‘independent components’ of natural scenes are edge filters. [sent-274, score-0.287]
100 Independent component filters of natural images compared with simple cells in primary visual cortex. [sent-308, score-0.257]
wordName wordTfidf (topN-words)
[('sparseness', 0.337), ('primitives', 0.321), ('ais', 0.235), ('overcompleteness', 0.235), ('sparse', 0.181), ('scenes', 0.158), ('vbem', 0.134), ('natural', 0.129), ('annealed', 0.124), ('annealing', 0.123), ('coding', 0.119), ('ard', 0.117), ('nats', 0.117), ('structural', 0.112), ('free', 0.108), ('initialised', 0.107), ('regime', 0.103), ('patches', 0.097), ('sparsity', 0.094), ('latent', 0.094), ('energy', 0.094), ('variational', 0.091), ('schedule', 0.089), ('dv', 0.085), ('levels', 0.078), ('scene', 0.077), ('images', 0.077), ('vb', 0.075), ('simulations', 0.075), ('importance', 0.07), ('weights', 0.07), ('prior', 0.069), ('favours', 0.067), ('grew', 0.067), ('gxt', 0.067), ('modestly', 0.067), ('nyt', 0.067), ('sparsities', 0.067), ('overcomplete', 0.067), ('freedom', 0.066), ('coupling', 0.063), ('marginal', 0.062), ('generative', 0.062), ('iterations', 0.061), ('precision', 0.06), ('star', 0.059), ('sparsenesses', 0.058), ('started', 0.057), ('biased', 0.057), ('optimal', 0.054), ('gatsby', 0.054), ('enjoyed', 0.053), ('pertinent', 0.053), ('hateren', 0.053), ('bayes', 0.053), ('inferred', 0.053), ('scheme', 0.052), ('origin', 0.051), ('visual', 0.051), ('posterior', 0.05), ('temperatures', 0.05), ('saturates', 0.05), ('degree', 0.05), ('performed', 0.05), ('determining', 0.049), ('discover', 0.049), ('olshausen', 0.047), ('comprised', 0.047), ('evaluating', 0.047), ('bayesian', 0.046), ('trend', 0.045), ('gamma', 0.045), ('indeed', 0.045), ('osindero', 0.045), ('optimise', 0.045), ('hyperparameters', 0.044), ('bias', 0.043), ('retained', 0.043), ('xt', 0.042), ('questions', 0.042), ('schemes', 0.042), ('active', 0.041), ('highest', 0.04), ('components', 0.04), ('differing', 0.04), ('spaced', 0.04), ('model', 0.04), ('image', 0.039), ('gk', 0.038), ('toward', 0.038), ('log', 0.038), ('true', 0.038), ('composed', 0.038), ('forward', 0.038), ('consisted', 0.037), ('laplace', 0.037), ('sampler', 0.037), ('might', 0.037), ('learn', 0.036), ('teh', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
2 0.21768571 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements
Author: Ben Williams, Marc Toussaint, Amos J. Storkey
Abstract: Biological movement is built up of sub-blocks or motion primitives. Such primitives provide a compact representation of movement which is also desirable in robotic control applications. We analyse handwriting data to gain a better understanding of primitives and their timings in biological movements. Inference of the shape and the timing of primitives can be done using a factorial HMM based model, allowing the handwriting to be represented in primitive timing space. This representation provides a distribution of spikes corresponding to the primitive activations, which can also be modelled using HMM architectures. We show how the coupling of the low level primitive model, and the higher level timing model during inference can produce good reconstructions of handwriting, with shared primitives for all characters modelled. This coupled model also captures the variance profile of the dataset which is accounted for by spike timing jitter. The timing code provides a compact representation of the movement while generating a movement without an explicit timing model produces a scribbling style of output. 1
3 0.19276202 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
4 0.18254393 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
5 0.13959427 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
6 0.13527985 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
7 0.11177319 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
8 0.10148083 213 nips-2007-Variational Inference for Diffusion Processes
9 0.095233545 47 nips-2007-Collapsed Variational Inference for HDP
10 0.088578537 195 nips-2007-The Generalized FITC Approximation
11 0.084203802 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
12 0.082762852 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
13 0.08223363 143 nips-2007-Object Recognition by Scene Alignment
14 0.079998292 96 nips-2007-Heterogeneous Component Analysis
15 0.078762829 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior
16 0.073530331 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
17 0.070220053 170 nips-2007-Robust Regression with Twinned Gaussian Processes
18 0.069826841 183 nips-2007-Spatial Latent Dirichlet Allocation
19 0.06951052 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
20 0.063789152 99 nips-2007-Hierarchical Penalization
topicId topicWeight
[(0, -0.26), (1, 0.087), (2, 0.033), (3, -0.124), (4, -0.002), (5, 0.031), (6, -0.17), (7, 0.056), (8, 0.001), (9, -0.071), (10, 0.147), (11, 0.03), (12, 0.079), (13, 0.013), (14, 0.055), (15, 0.085), (16, 0.016), (17, 0.187), (18, -0.029), (19, -0.212), (20, -0.126), (21, 0.055), (22, 0.134), (23, 0.022), (24, -0.079), (25, 0.01), (26, 0.01), (27, 0.081), (28, -0.056), (29, -0.092), (30, 0.137), (31, -0.125), (32, 0.111), (33, -0.046), (34, -0.066), (35, -0.068), (36, -0.121), (37, -0.068), (38, -0.082), (39, -0.024), (40, 0.132), (41, 0.095), (42, -0.036), (43, 0.049), (44, -0.023), (45, -0.041), (46, 0.037), (47, -0.005), (48, -0.011), (49, 0.17)]
simIndex simValue paperId paperTitle
same-paper 1 0.94131166 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
2 0.72378105 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
3 0.69956678 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements
Author: Ben Williams, Marc Toussaint, Amos J. Storkey
Abstract: Biological movement is built up of sub-blocks or motion primitives. Such primitives provide a compact representation of movement which is also desirable in robotic control applications. We analyse handwriting data to gain a better understanding of primitives and their timings in biological movements. Inference of the shape and the timing of primitives can be done using a factorial HMM based model, allowing the handwriting to be represented in primitive timing space. This representation provides a distribution of spikes corresponding to the primitive activations, which can also be modelled using HMM architectures. We show how the coupling of the low level primitive model, and the higher level timing model during inference can produce good reconstructions of handwriting, with shared primitives for all characters modelled. This coupled model also captures the variance profile of the dataset which is accounted for by spike timing jitter. The timing code provides a compact representation of the movement while generating a movement without an explicit timing model produces a scribbling style of output. 1
4 0.64897144 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
5 0.63730192 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
6 0.5373615 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
7 0.52782661 96 nips-2007-Heterogeneous Component Analysis
8 0.50512671 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
9 0.48025274 195 nips-2007-The Generalized FITC Approximation
10 0.46570259 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
11 0.46493882 182 nips-2007-Sparse deep belief net model for visual area V2
12 0.41176093 3 nips-2007-A Bayesian Model of Conditioned Perception
13 0.40752375 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
14 0.39462543 179 nips-2007-SpAM: Sparse Additive Models
15 0.39062521 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
16 0.38972259 19 nips-2007-Active Preference Learning with Discrete Choice Data
17 0.38952827 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
18 0.38100991 196 nips-2007-The Infinite Gamma-Poisson Feature Model
19 0.3725026 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
20 0.36916608 170 nips-2007-Robust Regression with Twinned Gaussian Processes
topicId topicWeight
[(5, 0.07), (13, 0.013), (16, 0.027), (18, 0.016), (21, 0.058), (35, 0.026), (47, 0.531), (83, 0.096), (85, 0.017), (87, 0.01), (90, 0.074)]
simIndex simValue paperId paperTitle
1 0.98053312 114 nips-2007-Learning and using relational theories
Author: Charles Kemp, Noah Goodman, Joshua B. Tenenbaum
Abstract: Much of human knowledge is organized into sophisticated systems that are often called intuitive theories. We propose that intuitive theories are mentally represented in a logical language, and that the subjective complexity of a theory is determined by the length of its representation in this language. This complexity measure helps to explain how theories are learned from relational data, and how they support inductive inferences about unobserved relations. We describe two experiments that test our approach, and show that it provides a better account of human learning and reasoning than an approach developed by Goodman [1]. What is a theory, and what makes one theory better than another? Questions like these are of obvious interest to philosophers of science but are also discussed by psychologists, who have argued that everyday knowledge is organized into rich and complex systems that are similar in many respects to scientific theories. Even young children, for instance, have systematic beliefs about domains including folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many of the same roles as scientific theories: in particular, both kinds of theories are used to explain and encode observations of the world, and to predict future observations. This paper explores the nature, use and acquisition of simple theories. Consider, for instance, an anthropologist who has just begun to study the social structure of a remote tribe, and observes that certain words are used to indicate relationships between selected pairs of individuals. Suppose that term T1(·, ·) can be glossed as ancestor(·, ·), and that T2(·, ·) can be glossed as friend(·, ·). The anthropologist might discover that the first term is transitive, and that the second term is symmetric with a few exceptions. Suppose that term T3(·, ·) can be glossed as defers to(·, ·), and that the tribe divides into two castes such that members of the second caste defer to members of the first caste. In this case the anthropologist might discover two latent concepts (caste 1(·) and caste 2(·)) along with the relationship between these concepts. As these examples suggest, a theory can be defined as a system of laws and concepts that specify the relationships between the elements in some domain [2]. We will consider how these theories are learned, how they are used to encode relational data, and how they support predictions about unobserved relations. Our approach to all three problems relies on the notion of subjective complexity. We propose that theory learners prefer simple theories, that people remember relational data in terms of the simplest underlying theory, and that people extend a partially observed data set according to the simplest theory that is consistent with their observations. There is no guarantee that a single measure of subjective complexity can do all of the work that we require [3]. This paper, however, explores the strong hypothesis that a single measure will suffice. Our formal treatment of subjective complexity begins with the question of how theories are mentally represented. We suggest that theories are represented in some logical language, and propose a specific first-order language that serves as a hypothesis about the “language of thought.” We then pursue the idea that the subjective complexity of a theory corresponds to the length of its representation in this language. Our approach therefore builds on the work of Feldman [4], and is related to other psychological applications of the notion of Kolmogorov complexity [5]. The complexity measure we describe can be used to define a probability distribution over a space of theories, and we develop a model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also 1 (a) Star 11 (b) Bipartite (c) Exception 22 33 44 55 66 77 88 16 26 36 46 56 21 31 41 51 61 71 81 11 17 27 37 47 57 18 28 38 48 58 R(X, X). T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T R(X, 1). (d) Symmetric (e) Transitive 11 22 33 44 55 66 77 13 31 12 21 24 42 56 65 26 36 46 56 17 27 37 47 57 18 28 38 48 58 T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T ¯ R(1, 1). R(1, 6). (f) Random 12 21 13 23 14 24 34 13 32 14 24 34 15 25 35 45 16 26 36 46 56 51 52 35 54 61 26 63 46 56 R(1, 2). R(1, 3). R(2, 4). R(5, 6). R(1, 2). R(2, 3). R(3, 4). R(5, X). R(X, 4). R(X, Y) ← R(Y, X). R(X, X). R(4, 5). R(5, 6). R(X, Z) ← R(X, Y), R(Y, Z). R(2, 1). R(1, 3). R(6, 1). R(3, 2). R(2, 6). R(3, 5). R(6, 3). R(4, 6). ¯ ¯ ¯ R(X, X). R(6, 4). R(5, 3). Figure 1: Six possible extensions for a binary predicate R(·, ·). In each case, the objects in the domain are represented as digits, and a pair such as 16 indicates that R(1, 6) is true. Below each set of pairs, the simplest theory according to our complexity measure is shown. show how the same Bayesian approach helps to explain how theories support inductive generalization: given a set of observations, future observations (e.g. whether one individual defers to another) can be predicted using the posterior distribution over the space of theories. We test our approach by developing two experiments where people learn and make predictions about binary and ternary relations. As far as we know, the approach of Goodman [1] is the only other measure of theory complexity that has previously been tested as a psychological model [6]. We show that our experiments support our approach and raise challenges for this alternative model. 1 Theory complexity: a representation length approach Intuitive theories correspond to mental representations of some sort, and our first task is to characterize the elements used to build these representations. We explore the idea that a theory is a system of statements in a logical language, and six examples are shown in Fig. 1. The theory in Fig. 1b is related to the defers to(·, ·) example already described. Here we are interested in a domain including 9 elements, and a two place predicate R(·, ·) that is true of all and only the 15 pairs shown. R is defined using a unary predicate T which is true of only three elements: 6, 7, and 8. The theory includes a clause which states that R(X, Y) is true for all pairs XY such that T(X) is false and T(Y) is true. The theory in Fig. 1c is very similar, but includes an additional clause which specifies that R(1, 1) is true, and an exception which specifies that R(1, 6) is false. Formally, each theory we consider is a collection of function-free definite clauses. All variables are universally quantified: for instance, the clause R(X, Z) ← R(X, Y), R(Y, Z) is equivalent to the logical formula ∀x ∀y ∀z (R(x, z) ← R(x, y) ∧ R(y, z)). For readability, the theories in Fig. 1 include parentheses and arrows, but note that these symbols are unnecessary and can be removed. Our proposed language includes only predicate symbols, variable symbols, constant symbols, and a period that indicates when one clause finishes and another begins. Each theory in Fig. 1 specifies the extension of one or more predicates. The extension of predicate P is defined in terms of predicate P+ (which captures the basic rules that lead to membership in P) and predicate P− (which captures exceptions to these rules). The resulting extension of P is defined 2 as P+ \ P− , or the set difference of P+ and P− .1 Once P has been defined, later clauses in the theory may refer to P or its negation ¯. To ensure that our semantics is well-defined, the predicates P in any valid theory must permit an ordering so that the definition of any predicate does not refer to predicates that follow it in the order. Formally, the definition of each predicate P+ or P− can refer only to itself (recursive definitions are allowed) and to any predicate M or ¯ where M < P. M Once we have committed to a specific language, the subjective complexity of a theory is assumed to correspond to the number of symbols in its representation. We have chosen a language where there is one symbol for each position in a theory where a predicate, variable or constant appears, and one symbol to indicate when each clause ends. Given this language, the subjective complexity c(T ) of theory T is equal to the sum of the number of clauses in the theory and the number of positions in the theory where a predicate, variable or constant appears: c(T ) = #clauses(T ) + #pred slots(T ) + #var slots(T ) + #const slots(T ). (1) For instance, the clause R(X, Z) ← R(X, Y), R(Y, Z). contributes ten symbols towards the complexity of a theory (three predicate symbols, six variable symbols, and one period). Other languages might be considered: for instance, we could use a language which uses five symbols (e.g. five bits) to represent each predicate, variable and constant, and one symbol (e.g. one bit) to indicate the end of a clause. Our approach to subjective complexity depends critically on the representation language, but once a language has been chosen the complexity measure is uniquely specified. Although our approach is closely related to the notion of Kolmogorov complexity and to Minimum Message Length (MML) and Minimum Description Length (MDL) approaches, we refer to it as a Representation Length (RL) approach. A RL approach includes a commitment to a specific language that is proposed as a psychological hypothesis, but these other approaches aspire towards results that do not depend on the language chosen.2 It is sometimes suggested that the notion of Kolmogorov complexity provides a more suitable framework for psychological research than the RL approach, precisely because it allows for results that do not depend on a specific description language [8]. We subscribe to the opposite view. Mental representations presumably rely on some particular language, and identifying this language is a central challenge for psychological research. The language we described should be considered as a tentative approximation of the language of thought. Other languages can and should be explored, but our language has several appealing properties. Feldman [4] has argued that definite clauses are psychologically natural, and working with these representations allows our approach to account for several classic results from the concept learning literature. For instance, our language leads to the prediction that conjunctive concepts are easier to learn than disjunctive concepts [9].3 Working with definite clauses also ensures that each of our theories has a unique minimal model, which means that the extension of a theory can be defined in a particularly simple way. Finally, human learners deal gracefully with noise and exceptions, and our language provides a simple way to handle exceptions. Any concrete proposal about the language of thought should make predictions about memory, learning and reasoning. Suppose that data set D lists the extensions of one or more predicates, and that a theory is a “candidate theory” for D if it correctly defines the extensions of all predicates in D. Note that a candidate theory may well include latent predicates—predicates that do not appear in D, but are useful for defining the predicates that have been observed. We will assume that humans encode D in terms of the simplest candidate theory for D, and that the difficulty of memorizing D is determined by the subjective complexity of this theory. Our approach can and should be tested against classic results from the memory literature. Unlike some other approaches to complexity [10], for instance, our model predicts that a sequence of k items is about equally easy to remember regardless of whether the items are drawn from a set of size 2, a set of size 10, or a set of size 1000 [11]. 1 The extension of P+ is the smallest set that satisfies all of the clauses that define P+ , and the extension of P is defined similarly. To simplify our notation, Fig. 1 uses P to refer to both P and P+ , and ¯ to refer to ¯ and P P P− . Any instance of P that appears in a clause defining P is really an instance of P+ , and any instance of ¯ that P appears in a clause defining ¯ is really an instance of P− . P 2 MDL approaches also commit to a specific language, but this language is often intended to be as general as possible. See, for instance, the discussion of universal codes in Gr¨ nwald et al. [7]. u 3 A conjunctive concept C(·) can be defined using a single clause: C(X) ← A(X), B(X). The shortest definition of a disjunctive concept requires two clauses: D(X) ← A(X). D(X) ← B(X). − 3 To develop a model of inductive learning and reasoning, we take a Bayesian approach, and use our complexity measure to define a prior distribution over a hypothesis space of theories: P (T ) ∝ 2−c(T ) .4 Given this prior distribution, we can use Bayesian inference to make predictions about unobserved relations and to discover the theory T that best accounts for the observations in data set D [12, 13]. Suppose that we have a likelihood function P (D|T ) which specifies how the examples in D were generated from some underlying theory T . The best explanation for the data D is the theory that maximizes the posterior distribution P (T |D) ∝ P (D|T )P (T ). If we need to predict whether ground term g is likely to be true, 5 we can sum over the space of theories: P (g|D) = P (g|T )P (T |D) = T 1 P (D) P (D|T )P (T ) (2) T :g∈T where the final sum is over all theories T that make ground term g true. 1.1 Related work The theories we consider are closely related to logic programs, and methods for Inductive Logic Programming (ILP) explore how these programs can be learned from examples [14]. ILP algorithms are often inspired by the idea of searching for the shortest theory that accounts for the available data, and ILP is occasionally cast as the problem of minimizing an explicit MDL criterion [10]. Although ILP algorithms are rarely considered as cognitive models, the RL approach has a long psychological history, and is proposed by Chomsky [15] and Leeuwenberg [16] among others. Formal measures of complexity have been developed in many fields [17], and there is at least one other psychological account of theory complexity. Goodman [1] developed a complexity measure that was originally a philosophical proposal about scientific theories, but was later tested as a model of subjective complexity [6]. A detailed description of this measure is not possible here, but we attempt to give a flavor of the approach. Suppose that a basis is a set of predicates. The starting point for Goodman’s model is the intuition that basis B1 is at least as complex as basis B2 if B1 can be used to define B2. Goodman argues that this intuition is flawed, but his model is founded on a refinement of this intuition. For instance, since the binary predicate in Fig. 1b can be defined in terms of two unary predicates, Goodman’s approach requires that the complexity of the binary predicate is no more than the sum of the complexities of the two unary predicates. We will use Goodman’s model as a baseline for evaluating our own approach, and a comparison between these two models should be informed by both theoretical and empirical considerations. On the theoretical side, our approach relies on a simple principle for deciding which structural properties are relevant to the measurement of complexity: the relevant properties are those with short logical representations. Goodman’s approach incorporates no such principle, and he proposes somewhat arbitrarily that reflexivity and symmetry are among the relevant structural properties but that transitivity is not. A second reason for preferring our model is that it makes contact with a general principle—the idea that simplicity is related to representation length—that has found many applications across psychology, machine learning, and philosophy. 2 Experimental results We designed two experiments to explore settings where people learn, remember, and make inductive inferences about relational data. Although theories often consist of systems of many interlocking relations, we keep our experiments simple by asking subjects to learn and reason about a single relation at a time. Despite this restriction, our experiments still make contact with several issues raised by systems of relations. As the defers to(·, ·) example suggests, a single relation may be best explained as the observable tip of a system involving several latent predicates (e.g. caste 1(·) and caste 2(·)). 4 To ensure that this distribution can be normalized, we assume that there is some upper bound on the number of predicate symbols, variable symbols, and constants, and on the length of the theories we will consider. There will therefore be a finite number of possible theories, and our prior will be a valid probability distribution. 5 A ground term is a term such as R(8, 9) that does not include any variables. 4 Learning time Complexity (RL) Complexity (Human) 6 300 Complexity (Goodman) 4 20 0 0 0 star bprt excp sym trans rand 2 0 star bprt excp sym trans rand 2 star bprt excp sym trans rand 100 200 star bprt excp sym trans rand 4 40 Figure 2: (a) Average time in seconds to learn the six sets in Fig. 1. (b) Average ratings of set complexity. (c) Complexity scores according to our representation length (RL) model. (d) Complexity scores according to Goodman’s model. 2.1 Experiment 1: memory and induction In our first experiment, we studied the subjective complexity of six binary relations that display a range of structural properties, including reflexivity, symmetry, and transitivity. Materials and Methods. 18 adults participated in this experiment. Subjects were required to learn the 6 sets shown in Fig. 1, and to make inductive inferences about each set. Although Fig. 1 shows pairs of digits, the experiment used letter pairs, and the letters for each condition and the order in which these conditions were presented were randomized across subjects. The pairs for each condition were initially laid out randomly on screen, and subjects could drag them around and organize them to help them understand the structure of the set. At any stage, subjects could enter a test phase where they were asked to list the 15 pairs belonging to the current set. Subjects who made an error on the test were returned to the learning phase. After 9 minutes had elapsed, subjects were allowed to pass the test regardless of how many errors they made. After passing the test, subjects were asked to rate the complexity of the set compared to other sets with 15 pairs. Ratings were provided on a 7 point scale. Subjects were then asked to imagine that a new letter (e.g. letter 9) had belonged to the current alphabet, and were given two inductive tasks. First they were asked to enter between 1 and 10 novel pairs that they might have expected to see (each novel pair was required to include the new letter). Next they were told about a novel pair that belonged to the set (e.g. pair 91), and were again asked to enter up to 10 additional pairs that they might have expected to see. Results. The average time needed to learn each set is shown in Fig. 2a, and ratings of set complexity are shown in Fig. 2b. It is encouraging that these measures yield converging results, but they may be confounded since subjects rated the complexity of a set immediately after learning it. The complexities plotted in Fig. 2c are the complexities of the theories shown in Fig. 1, which we believe to be the simplest theories according to our complexity measure. The final plot in Fig. 2 shows complexities according to Goodman’s model, which assigns each binary relation an integer between 0 and 4. There are several differences between these models: for instance, Goodman’s account incorrectly predicts that the exception case is the hardest of the six, but our model acknowledges that a simple theory remains simple if a handful of exceptions are added. Goodman’s account also predicts that transitivity is not an important structural regularity, but our model correctly predicts that the transitive set is simpler than the same set with some of the pairs reversed (the random set). Results for the inductive task are shown in Fig. 3. The first two columns show the number of subjects who listed each novel pair. The remaining two columns show the probability of set membership predicted by our model. To generate these predictions, we applied Equation 2 and summed over a set of theories created by systematically extending the theories shown in Fig. 1. Each extended theory includes up to one additional clause for each predicate in the base theory, and each additional clause includes at most two predicate slots. For instance, each extended theory for the bipartite case is created by choosing whether or not to add the clause T(9), and adding up to one clause for predicate R.6 For the first inductive task, the likelihood term P (D|T ) (see Equation 2) is set to 0 for all theories that are not consistent with the pairs observed during training, and to a constant for all remaining theories. For the second task we assumed in addition that the novel pair observed is 6 R(9, X), ¯(2, 9), and R(X, 9) ← R(X, 2) are three possible additions. R 5 18 9 9 0.5 trans symm excep bipart 0 91 random r=0.99 1 star 18 99 19 0 91 89 99 19 89 0.5 0 91 18 18 1 9 9 99 19 r=0.96 89 0.5 0 91 99 19 0 91 89 99 19 89 18 9 1 99 19 89 r=0.98 1 9 0.5 0 91 99 19 0 91 89 99 19 89 18 9 99 19 89 0.5 0 81 88 18 0 78 81 88 18 78 0 18 18 9 0 0 71 77 17 67 71 77 17 67 18 18 81 9 88 18 r=0.62 78 71 77 17 67 Human (no examples) 0 71 77 17 67 Human (1 example) 0 0 91 99 19 89 r=0.99 0 81 88 18 78 71 77 17 67 1 71 77 17 67 r=0.38 0.5 0 89 r=0.93 0.5 1 9 99 19 r=0.99 1 0.5 0 89 r=0.99 0.5 1 9 0 91 1 r=0.88 1 9 99 19 0.5 0 91 18 0 91 0.5 0 91 18 r=0.99 1 0 r=0.74 1 0.5 71 77 17 67 RL (no examples) 0 71 77 17 67 RL (one example) Figure 3: Data and model predictions for the induction task in Experiment 1. Columns 1 and 3 show predictions before any pairs involving the new letter are observed. Columns 2 and 4 show predictions after a single novel pair (marked with a gray bar) is observed to belong to the set. The model plots for each condition include correlations with the human data. sampled at random from all pairs involving the new letter.7 All model predictions were computed using Mace4 [18] to generate the extension of each theory considered. The supporting material includes predictions for a model based on the Goodman complexity measure and an exemplar model which assumes that the new letter will be just like one of the old letters.8 The exemplar model outperforms our model in the random condition, and makes accurate predictions about three other conditions. Overall, however, our model performs better than the two baselines. Here we focus on two important predictions that are not well handled by the exemplar model. In the symmetry condition, almost all subjects predict that 78 belongs to the set after learning that 87 belongs to the set, suggesting that they have learned an abstract rule. In the transitive condition, most subjects predict that pairs 72 through 76 belong to the set after learning that 71 belongs to the set. Our model accounts for this result, but the exemplar model has no basis for making predictions about letter 7, since this letter is now known to be unlike any of the others. 2.2 Experiment 2: learning from positive examples During the learning phase of our first experiment, subjects learned a theory based on positive examples (the theory included all pairs they had seen) and negative examples (the theory ruled out all pairs they had not seen). Often, however, humans learn theories based on positive examples alone. Suppose, for instance, that our anthropologist has spent only a few hours with a new tribe. She may have observed several pairs who are obviously friends, but should realize that many other pairs of friends have not yet interacted in her presence. 7 For the second task, P (D|T ) is set to 0 for theories that are inconsistent with the training pairs and theories 1 which do not include the observed novel pair. For all remaining theories, P (D|T ) is set to n , where n is the total number of novel pairs that are consistent with T . 8 Supporting material is available at www.charleskemp.com 6 1 221 331 441 551 c) 7 1 R(X, X, Y). 221 443 552 663 d) 7 1 R(X, Y, Z). 231 456 615 344 e) 7 1 −10 −5 −0.1 −10 −20 −20 −10 −0.2 −20 777 771 778 789 237 777 771 778 789 237 −10 231 234 235 236 0 777 771 778 789 237 0 777 771 778 789 237 0 777 771 778 789 237 0 RL 0 R(2, 3, X). 777 771 778 789 237 7 R(X, X, 1). 777 771 778 789 237 b) 777 771 778 789 237 1 111 222 333 444 777 771 778 789 237 7 R(X, X, X). 777 771 778 789 237 Human a) Figure 4: Data and model predictions for Experiment 2. The four triples observed for each set are shown at the top of the figure. The first row of plots shows average ratings on a scale from 1 (very unlikely to belong to the set) to 7 (very likely). Model predictions are plotted as log probabilities. Our framework can handle cases like these if we assume that the data D in Equation 2 are sampled from the ground terms that are true according to the underlying theory. We follow [10] and [13] and use a distribution P (D|T ) which assumes that the examples in D are randomly sampled with replacement from the ground terms that are true. This sampling assumption encourages our model to identify the theory with the smallest extension that is compatible with all of the training examples. We tested this approach by designing an experiment where learners were given sets of examples that were compatible with several underlying theories. Materials and Methods. 15 adults participated in this experiment immediately after taking Experiment 1. In each of five conditions, subjects were told about a set of triples built from an alphabet of 9 letters. They were shown four triples that belonged to the set (Fig. 4), and told that the set might include triples that they had not seen. Subjects then gave ratings on a seven point scale to indicate whether five additional triples (see Fig. 4) were likely to belong to the set. Results. Average ratings and model predictions are shown in Fig. 4. Model predictions for each condition were computed using Equation 2 and summing over a space of theories that included the five theories shown at the top of Fig. 4, variants of these five theories which stated that certain pairs of slots could not be occupied by the same constant,9 and theories that included no variables but merely enumerated up to 5 triples.10 Although there are general theories like R(X, Y, Z) that are compatible with the triples observed in all five conditions, Fig. 4 shows that people were sensitive to different regularities in each case.11 We focus on one condition (Fig. 4b) that exposes the strengths and weaknesses of our model. According to our model, the two most probable theories given the triples for this condition are R(X, X, 1) and the closely related variant that rules out R(1, 1, 1). The next most probable theory is R(X, X, Y). These predictions are consistent with people’s judgments that 771 is very likely to belong to the set, and that 778 is the next most likely option. Unlike our model, however, people consider 777 to be substantially less likely than 778 to belong to the set. This result may suggest that the variant of R(X, X, Y) that rules out R(X, X, X) deserves a higher prior probability than our model recognizes. To better account for cases like this, it may be worth considering languages where any two variables that belong to the same clause but have different names must refer to different entities. 3 Discussion and Conclusion There are many psychological models of concept learning [4, 12, 13], but few that use representations rich enough to capture the content of intuitive theories. We suggested that intuitive theories are mentally represented in a first-order logical language, and proposed a specific hypothesis about 9 ¯ One such theory includes two clauses: R(X, X, Y). R(X, X, X). One such theory is the following list of clauses: R(2, 2, 1). R(3, 3, 1). R(4, 4, 1). R(5, 5, 1). R(7, 7, 7). 11 Similar results have been found with 9-month old infants. Cases like Figs. 4b and 4c have been tested in an infant language-learning study where the stimuli were three-syllable strings [19]. 9-month old infants exposed to strings like the four in Fig. 4c generalized to other strings consistent with the theory R(X, X, Y), but infants in the condition corresponding to Fig. 4b generalized only to strings consistent with the theory R(X, X, 1). 10 7 this “language of thought.” We assumed that the subjective complexity of a theory depends on the length of its representation in this language, and described experiments which suggest that the resulting complexity measure helps to explain how theories are learned and used for inductive inference. Our experiments deliberately used stimuli that minimize the influence of prior knowledge. Theories, however, are cumulative, and the theory that seems simplest to a learner will often depend on her background knowledge. Our approach provides a natural place for background knowledge to be inserted. A learner can be supplied with a stock of background predicates, and the shortest representation for a data set will depend on which background predicates are available. Since different sets of predicates will lead to different predictions about subjective complexity, empirical results can help to determine the background knowledge that people bring to a given class of problems. Future work should aim to refine the representation language and complexity measure we proposed. We expect that something like our approach will be suitable for modeling a broad class of intuitive theories, but the specific framework presented here can almost certainly be improved. Future work should also consider different strategies for searching the space of theories. Some of the strategies developed in the ILP literature should be relevant [14], but a detailed investigation of search algorithms seems premature until our approach has held up to additional empirical tests. It is comparatively easy to establish whether the theories that are simple according to our approach are also considered simple by people, and our experiments have made a start in this direction. It is much harder to establish that our approach captures most of the theories that are subjectively simple, and more exhaustive experiments are needed before this conclusion can be drawn. Boolean concept learning has been studied for more than fifty years [4, 9], and many psychologists have made empirical and theoretical contributions to this field. An even greater effort will be needed to crack the problem of theory learning, since the space of intuitive theories is much richer than the space of Boolean concepts. The difficulty of this problem should not be underestimated, but computational approaches can contribute part of the solution. Acknowledgments Supported by the William Asbjornsen Albert memorial fellowship (CK), the James S. McDonnell Foundation Causal Learning Collaborative Initiative (NDG, JBT) and the Paul E. Newton chair (JBT). References [1] N. Goodman. The structure of appearance. 2nd edition, 1961. [2] S. Carey. Conceptual change in childhood. MIT Press, Cambridge, MA, 1985. [3] H. A. Simon. Complexity and the representation of patterned sequences of symbols. Psychological Review, 79:369–382, 1972. [4] J. Feldman. An algebra of human concept learning. JMP, 50:339–368, 2006. [5] N. Chater and P. Vitanyi. Simplicity: a unifying principle in cognitive science. TICS, 7:19–22, 2003. [6] J. T. Krueger. A theory of structural simplicity and its relevance to aspects of memory, perception, and conceptual naturalness. PhD thesis, University of Pennsylvania, 1979. [7] P. Gr¨ nwald, I. J. Myung, and M. Pitt, editors. Advances in Minimum Description Length: Theory and u Applications. 2005. [8] N. Chater. Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, 103:566–581, 1996. [9] J. A. Bruner, J. S. Goodnow, and G. J. Austin. A study of thinking. Wiley, 1956. [10] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(1):81–97, 1956. [12] N. D. Goodman, T. L. Griffiths, J. Feldman, and J. B. Tenenbaum. A rational analysis of rule-based concept learning. In CogSci, 2007. [13] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. BBS, 24:629–641, 2001. [14] S. Muggleton and L. De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19-20:629–679, 1994. [15] N. Chomsky. The logical structure of linguistic theory. University of Chicago Press, Chicago, 1975. [16] E. L. J. Leeuwenberg. A perceptual coding language for visual and auditory patterns. American Journal of Psychology, 84(3):307–349, 1971. [17] B. Edmonds. Syntactic measures of complexity. PhD thesis, University of Manchester, 1999. [18] W. McCune. Mace4 reference manual and guide. Technical Report ANL/MCS-TM-264, Argonne National Laboratory, 2003. [19] L. Gerken. Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition, 98(3):67–74, 2006. 8
2 0.97665143 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy
Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1
3 0.97210681 99 nips-2007-Hierarchical Penalization
Author: Marie Szafranski, Yves Grandvalet, Pierre Morizet-mahoudeaux
Abstract: Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels. Keywords – Optimization: constrained and convex optimization. Supervised learning: regression, kernel methods, sparsity and feature selection. 1
4 0.96804446 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
Author: Gonzalo Carvajal, Waldo Valenzuela, Miguel Figueroa
Abstract: We describe an analog-VLSI neural network for face recognition based on subspace methods. The system uses a dimensionality-reduction network whose coefficients can be either programmed or learned on-chip to perform PCA, or programmed to perform LDA. A second network with userprogrammed coefficients performs classification with Manhattan distances. The system uses on-chip compensation techniques to reduce the effects of device mismatch. Using the ORL database with 12x12-pixel images, our circuit achieves up to 85% classification performance (98% of an equivalent software implementation). 1
same-paper 5 0.95973325 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
6 0.76557225 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
7 0.74626446 100 nips-2007-Hippocampal Contributions to Control: The Third Way
8 0.71128857 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
9 0.69863212 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
10 0.69740033 169 nips-2007-Retrieved context and the discovery of semantic structure
11 0.69564152 86 nips-2007-Exponential Family Predictive Representations of State
12 0.69472426 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
13 0.69350308 158 nips-2007-Probabilistic Matrix Factorization
14 0.69251454 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
15 0.69095075 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
16 0.68478268 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
17 0.68297118 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
18 0.6811645 164 nips-2007-Receptive Fields without Spike-Triggering
19 0.67972273 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
20 0.67850018 43 nips-2007-Catching Change-points with Lasso