nips nips2003 nips2003-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. [sent-6, score-0.396]
2 By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. [sent-7, score-0.693]
3 This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. [sent-8, score-0.596]
4 This learning has measurable physiological correlates in terms of changes in the stimulusresponse properties of individual neurons in the sensory systems of the brain (as well as in many other areas). [sent-10, score-0.52]
5 While passive exposure to sensory stimuli can have profound effects on the developing sensory cortex, significant plasticity in mature animals tends to be observed only in situations where sensory stimuli are associated with either behavioural or electrical reinforcement. [sent-11, score-1.506]
6 To be complete, understanding of sensory plasticity must come at two different levels. [sent-13, score-0.472]
7 At a mechanistic level, it is important to understand how synapses are modified, and how synaptic modifications can lead to observed changes in the response properties of cells. [sent-14, score-0.124]
8 Numerous experiments and models have addressed these questions of how sensory plastic- ity occurs. [sent-15, score-0.417]
9 Measured changes in sensory representation must underlie an adaptive change in neural information processing. [sent-17, score-0.503]
10 If we can understand the processing goals of sensory systems, and therefore understand how changes in representation advance these goals in the face of changing experience, we will have shed light on the question of why sensory plasticity occurs. [sent-18, score-1.003]
11 We therefore develop and validate (through simulation) an alternative optimisation approach based on the statistical technique of importance sampling. [sent-22, score-0.096]
12 2 Model The standard algorithms of reinforcement learning (RL) deal with an agent that receives rewards or penalties as it interacts with a world of known structure and, generally Markovian, dynamics [1]. [sent-23, score-0.319]
13 The agent passes through a series of “states”, choosing in each one an action which results (perhaps stochastically) in a payoff and in a transition to another state. [sent-24, score-0.103]
14 Associated with each state (or state-action pair) and a given policy of action is a value, which represents the expected payoff that would be received if the policy were to be followed starting from that initial state (and initial action). [sent-25, score-0.102]
15 Often the state that the agent occupies at each point in time is assumed to be directly observable. [sent-27, score-0.1]
16 In other cases, the agent receives only partial information about the state it occupies, although in almost all studies the basic structure of the world is assumed to be known. [sent-28, score-0.122]
17 In these partially observable models, then, the state information (which might be thought of as a form of sensory input) is used to estimate which one of a known group of states is currently occupied, and so a natural representation emerges in terms of a belief-distribution over states. [sent-29, score-0.503]
18 Instead, the agent must simultaneously discover a representation of the sensory inputs suitable for predicting the reinforcement value, and learn the action-contingent value function itself. [sent-31, score-0.719]
19 In probabilistic terms, solving it exactly would require coping with a complicated joint distribution over representational structures and value functions. [sent-33, score-0.126]
20 However, using an analogy to the variational inference methods of unsupervised learning [2], we might modularise our approach by factoring this joint into independent distributions over the sensory representation on the one hand and the value function on the other. [sent-34, score-0.595]
21 In this framework approximate estimation might proceed iteratively, using the current value function to tune the sensory representation, and then re¨ stimating the value function for the revised sensory encoding. [sent-35, score-0.932]
22 e The present work, being concerned with the way in which reinforcement guides sensory representational learning, focuses exclusively on the first of these two steps. [sent-36, score-0.713]
23 Thus, we take the value associated with the current sensory input to be given. [sent-37, score-0.5]
24 In many of the reinforcement schedules used in physiological experiments, however, the value is easily determined. [sent-39, score-0.24]
25 For example, in a classical conditioning paradigm the value is independent of action, and is given by the sum of the current reinforcement and the discounted average reinforcement received. [sent-40, score-0.415]
26 Our problem, then, is to develop a biologically plausible algorithm which is able to find a representation of the sensory input which facili- tates prediction of the value. [sent-41, score-0.563]
27 Although our eventual goal clearly fits well in the framework of RL, we find it useful to start from a standard theoretical account of unsupervised representational learning. [sent-42, score-0.185]
28 The view we adopt fits well with a Helmholtzian account of perceptual processing, in which the sensory cortex interprets the activities of receptors so as to infer the state of the external world that must have given rise to the observed pattern of activation. [sent-43, score-0.574]
29 Perception, by this account, may be thought of as a form of probabilistic inference in a generative model. [sent-44, score-0.057]
30 The general structure of such a model involves a set of latent variables or causes whose values directly reflect the underlying state of the world, along with a parameterisation of effects of these causes on immediate sensory experience. [sent-45, score-0.711]
31 Taken together, these variables would provide a causal account for observations that correspond to photoreceptor activation. [sent-47, score-0.076]
32 To apply such a framework as a model for cortical processing, then, we take the sensory cortical activity to represent the inferred values of the latent variables. [sent-48, score-0.742]
33 Thus, perceptual inference in this framework involves estimating the values of the causal variables that gave rise to the sensory input, while developmental (unsupervised) learning involves discovering the correct causal structure from sensory experience. [sent-49, score-1.079]
34 Such a treatment has been used to account for the structure of simple-cell receptive fields in visual cortex [3, 4], and has been extended to further visual cortical response properties in subsequent studies. [sent-50, score-0.237]
35 Thus, in addition to the latent causes Li that generate a sensory event Si , we consider an associated (possibly action-contingent) value Vi . [sent-52, score-0.641]
36 In particular, this means that whatever information Si carries about Vi is expressed (if the model is well-fit) in the cortical representation Li , making this structure appropriate for value prediction. [sent-55, score-0.175]
37 The causal variables Li have taken on the rˆ le of the o “state” in standard RL. [sent-56, score-0.076]
38 3 Objective function The natural objective in reinforcement learning is to maximise some form of accumulated reward. [sent-57, score-0.343]
39 That is, the parameters modelled (those determining the responses in the sensory cortex, rather than in associative or motor areas) do not directly control actions or policies of action. [sent-59, score-0.478]
40 Instead, these descriptive parameters only influence the animal’s accumulated reinforcement through the accuracy of the description they generate. [sent-60, score-0.255]
41 As a result, even though the ultimate objective may be to maximise total reward, we need to use objective functions that are closer in spirit to the likelihoods common in probabilistic unsupervised learning. [sent-61, score-0.211]
42 In particular, we consider functions of the form L(θ) = α(Vi ) log Pθ (Si ) + β(Vi ) log Pθ (Vi | Si ) i (2) In this expression, the two log probabilities reflect the accuracy of stimulus representation, and of value prediction, respectively. [sent-62, score-0.243]
43 These two terms would appear alone in a straightforward representational model of the joint distribution over sensory stimuli and values. [sent-63, score-0.552]
44 4 Learning While the objective function (2) does not depend explicitly on the cortical representation variables, it does depend on their distributions, through the marginal likelihoods Pθ (Si ) = dLi Pθ (Si , Li ) and Pθ (Vi | Si ) = dLi Pθ (Vi , Li | Si ). [sent-65, score-0.188]
45 We introduce 2N unknown probability distributions over the cortical representation, Qα (Li ) and Qβ (Li ). [sent-68, score-0.098]
46 It will be useful to rewrite (3c) as θ ← θ + η α(Vi ) θ log Pθ (Si , Li ) Qα (Li ) + β(Vi ) +β(Vi ) θ θ log Pθ (Li | Si ) log Pθ (Vi | Li ) Qβ (Li ) Qβ (Li ) (3c ) where the conditioning on Si in the final term in not needed due to the Markovian structure of the model. [sent-73, score-0.19]
47 5 Biologically Plausible Learning Could something like the updates of (3) underlie the task- or neuromodulator-driven changes that are seen in sensory cortex? [sent-74, score-0.491]
48 In (3a), the distribution Pθ (Li | Si ) represents the animal’s beliefs about the latent causes that led to the current sensory experience, and as such is the usual product of perceptual inference. [sent-76, score-0.63]
49 In (3c ), the various log probabilities involved are similarly natural products of perceptual or predictive computations. [sent-77, score-0.112]
50 First, the sensory input, Si , and the information needed to assess its associated value, Vi , often arrive at quite different times. [sent-80, score-0.454]
51 However, construction of the posterior distribution in its full detail requires simultaneous knowledge of both Si and Vi , and would therefore only be possible if rich information about the sensory stimulus were to be preserved until the associated value could be determined. [sent-81, score-0.526]
52 The feasibility of such detailed persistence of sensory information is unclear. [sent-82, score-0.417]
53 The connections from receptor epithelium to sensory areas of cortex are extensive, easily capable of conveying the information needed to estimate P(L | S). [sent-84, score-0.534]
54 By contrast, the brain structures that seem to be associated with the evaluation of reinforcement, such as the ventral tegmental area or nucleus basalis, make only sparse projections to early sensory cortex; and these projections are frequently modulatory in character, rather than synaptic. [sent-85, score-0.511]
55 It might seem at first that the former of these two problems would also apply to the weight α(Vi ) (in the first term of (3c )), in that execution of this portion of the update would also need to be delayed until this value-dependent weight could be calculated. [sent-87, score-0.089]
56 One possible way to do this is to actually carry out an update of the weights when just the sensory stimulus is known, but then consolidate this learning (or not) as indicated by the value-related weight. [sent-91, score-0.579]
57 Such a consolidation signal might easily be carried by a neuromodulatory projection from subcortical nuclei involved in the evaluation of reinforcement. [sent-92, score-0.17]
58 We propose to solve the problem posed by P(L | S, V ) in essentially the same way, that is by using information about reinforcement-value to guide modulatory reweighting or consolidation of synaptic changes that are initially based on the sensory stimulus alone. [sent-93, score-0.588]
59 Of course, sampling from this distribution is no more compatible with the foregoing biological constraints than integrating over it. [sent-96, score-0.057]
60 This approach to learning, which exploits the standard statistical technique of importance sampling [6], resolves both of the difficulties discussed above. [sent-99, score-0.093]
61 It implies that reinforcement-related processing and learning in the sensory systems of the brain proceeds in these stages: 1. [sent-100, score-0.473]
62 The sensory input is processed to infer beliefs about the latent causes Pθ (Li | Si ). [sent-101, score-0.62]
63 Synaptic weights are updated to follow the gradients θ log Pθ (Si , Li ) ˜ and θ log Pθ (Li | Si ) (corresponding to the first two terms of (3c ). [sent-104, score-0.168]
64 The associated value is predicted, both on the basis of the full posterior, giving ˜ Pθ (Vi | Si ), and on the basis of the sample(s), giving Pθ (Vi | Li ). [sent-106, score-0.062]
65 The actual value is observed or estimated, facilitating calculation of the weights ˜ α(Vi ), β(Vi ), and w(Li ). [sent-108, score-0.079]
66 These weights are conveyed to sensory cortex and used to consolidate (or not) the synaptic changes of step 2. [sent-110, score-0.645]
67 Such updates could be undertaken once the associated value became apparent; however, the parameters that represent the explicit dependence of value on the latent variables are unlikely to lie in the sensory cortex itself (instead determining computations in subsequent processing). [sent-112, score-0.763]
68 1 Distributional Sampling A commonly encountered difficulty with importance sampling has to do with the distribution of importance weights wi . [sent-114, score-0.204]
69 If the range of weights is too extensive, the optimisation will be driven primarily by few large weights, leading to slow and noisy learning. [sent-115, score-0.093]
70 Fortunately, it is possible to formulate an alternative, in which distributions over the cortical representational variables, rather than samples of the variables themselves, are randomly generated and weighted appropriately. [sent-116, score-0.258]
71 1 Let Pi (L) be a distribution over the latent causes L, drawn randomly from a functional distribution P(Pi | Si ), such that Pi (L) = P(Li | Si ). [sent-117, score-0.181]
72 Then, by analogy with P(Pi |Si ) the result above, it can be shown that given importance weights dL P(Vi | L)Pi (L) w(Pi ) = , P(Vi | Si ) (4) we have θ ˜ log Pθ (Li | Si ) Pi (L) w(Pi ) = Pi ∼P(Pi |Si ) θ log Pθ (Li | Si ) ˜ Li ∼P(Li |Si ,Vi ) . [sent-118, score-0.225]
73 These distributional samples can thus be used in almost exactly the same manner as the single-valued samples described above. [sent-119, score-0.095]
74 6 Simulation A paradigmatic generative model structure is that underlying factor analysis (FA) [7], in which both latent and observed variables are normally distributed: Pθ (Li ) = N (0, I) ; Pθ (Si | Li ) = N (ΛS Li , ΨS ) ; Pθ (Vi | Li ) = N (ΛV Li , ΨV ) . [sent-120, score-0.223]
75 (5) 1 This sampling scheme can also be formalised as standard importance sampling carried out with a cortical representation re-expressed in terms of the parameters determining the distribution Pi (L). [sent-121, score-0.308]
76 5 0 generative weights relative amplitude b 1 0. [sent-123, score-0.14]
77 5 0 unweighted learning relative amplitude c 1 0. [sent-124, score-0.055]
78 5 0 weighted learning sensory input dimension Figure 1: Generative and learned sensory weights. [sent-125, score-0.881]
79 This model is similar in its linear generative structure to the independent components analysis models that have previously been employed in accounts of unsupervised development of visual cortical properties [3, 4]; the only difference is in the assumed distribution of the latent variables. [sent-128, score-0.373]
80 We used a PFA-based simulation to verify that the distributional importance-weighted sampling procedure described here is indeed able to learn the correct model given sensory and reinforcement-value data. [sent-131, score-0.504]
81 Random vectors representing sensory inputs and associated values were generated according to (5); these were then used as inputs to a learning system. [sent-132, score-0.48]
82 The objective function optimised had both value-dependent weights α(Vi ) and β(Vi ) set to unity; thus the learning system simply attempted to model the joint distribution of sensory and reinforcement data. [sent-133, score-0.756]
83 The generative model comprised 11 latent variables, 40 observed sensory variables (which were arranged linearly so as to represent 40 discrete values along a single sensory axis), and a single reinforcement variable. [sent-134, score-1.233]
84 Ten of the latent variables only affected the sensory observations. [sent-135, score-0.587]
85 These features of the generative weight matrix were essential for PFA to be able to recover the generative model uniquely. [sent-139, score-0.144]
86 The final latent variable affected both reinforcement value and the sensory input at a single point (indicated by the dashed line in figure 1a). [sent-140, score-0.791]
87 Since the output noise matrix in PFA can associate arbitrary variance with each sensory variable, a model fit to only the sensory data would treat the influence of this latent cause as noise. [sent-141, score-0.944]
88 Only when the joint distribution over both sensory input and reinforcement is modelled will this aspect of the sensory data be captured in the model parameters. [sent-142, score-1.068]
89 In one case learning proceeded according to the sampled distributions Pi , with no importance weighting. [sent-150, score-0.083]
90 In the other, learning was modulated by the importance weights given by (4). [sent-151, score-0.137]
91 In particular, in both cases the reinforcement predictive weights ΛV were estimated, and in both cases the orthogonality constraint of PFA was applied to the combined estimated weight matrix [ΛS , ΛV ]. [sent-153, score-0.327]
92 Figure 1b and c shows the sensory weights ΛS learnt by each of these procedures (again the curves have been rescaled to show relative weights). [sent-154, score-0.493]
93 Both algorithms recovered the basic tuning properties; however, only the importance sampling algorithm was able to model the additional data feature that was linked to the prediction of reinforcement value. [sent-155, score-0.288]
94 The fact that in all other regards the two learning simulations were identical demonstrates that the importance weighting procedure (rather than, say, the orthogonality constraint) was responsible for this difference. [sent-156, score-0.181]
95 7 Summary This paper has presented a framework within which the experimentally observed impact of behavioural reinforcement on sensory plasticity might be understood. [sent-157, score-0.76]
96 This framework rests on a similar foundation to the recent work that has related unsupervised learning to sensory response properties. [sent-158, score-0.552]
97 It extends this foundation to consider prediction of the reinforcement value associated with sensory stimuli. [sent-159, score-0.699]
98 Direct learning by expectation-maximisation within this framework poses difficulties regarding biological plausibility. [sent-160, score-0.092]
99 However, these were resolved by the introduction of an importance sampled approach, along with its extension to distributional sampling. [sent-161, score-0.128]
100 Information about reinforcement is thus carried by a weighting signal that might be identified with the neuromodulatory signals in the brain. [sent-162, score-0.333]
wordName wordTfidf (topN-words)
[('li', 0.578), ('sensory', 0.417), ('vi', 0.353), ('si', 0.352), ('reinforcement', 0.195), ('latent', 0.11), ('representational', 0.101), ('cortical', 0.098), ('dli', 0.087), ('pi', 0.085), ('cortex', 0.072), ('pfa', 0.069), ('unsupervised', 0.065), ('log', 0.057), ('biologically', 0.057), ('importance', 0.057), ('generative', 0.057), ('plasticity', 0.055), ('weights', 0.054), ('causes', 0.052), ('neuromodulatory', 0.052), ('distributional', 0.051), ('maximise', 0.051), ('agent', 0.049), ('fa', 0.048), ('stimulus', 0.047), ('behavioural', 0.045), ('synaptic', 0.04), ('causal', 0.039), ('optimisation', 0.039), ('objective', 0.038), ('associated', 0.037), ('variables', 0.037), ('sampling', 0.036), ('plausible', 0.035), ('consolidate', 0.035), ('developmental', 0.035), ('stimuli', 0.034), ('accumulated', 0.033), ('representation', 0.033), ('rl', 0.032), ('action', 0.031), ('perceptual', 0.031), ('consolidation', 0.03), ('maximisations', 0.03), ('mechanistic', 0.03), ('nuclei', 0.03), ('weight', 0.03), ('world', 0.03), ('brain', 0.03), ('amplitude', 0.029), ('might', 0.029), ('carried', 0.029), ('weighting', 0.028), ('descriptive', 0.027), ('maneesh', 0.027), ('mature', 0.027), ('modulatory', 0.027), ('occupies', 0.027), ('regards', 0.027), ('changes', 0.027), ('understand', 0.027), ('learning', 0.026), ('markovian', 0.026), ('optimised', 0.026), ('poses', 0.026), ('receptor', 0.026), ('underlie', 0.026), ('foundation', 0.025), ('dif', 0.025), ('value', 0.025), ('state', 0.024), ('visual', 0.024), ('orthogonality', 0.024), ('policies', 0.024), ('predictive', 0.024), ('animals', 0.023), ('payoff', 0.023), ('affected', 0.023), ('experience', 0.022), ('rescaled', 0.022), ('samples', 0.022), ('biological', 0.021), ('updates', 0.021), ('input', 0.021), ('maxima', 0.02), ('physiological', 0.02), ('beliefs', 0.02), ('resolved', 0.02), ('em', 0.02), ('framework', 0.019), ('areas', 0.019), ('animal', 0.019), ('likelihoods', 0.019), ('structure', 0.019), ('simulations', 0.019), ('determining', 0.019), ('drawn', 0.019), ('modelled', 0.018), ('culty', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
2 0.19497621 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
3 0.1754826 128 nips-2003-Minimax Embeddings
Author: Matthew Brand
Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was first introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantifies the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph specified by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M specifies manifold connectivity in the sense that any nondegenerate embedding Y that satisfies YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive definite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we define the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identifies the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting first solves for a set of local affine subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of affine projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an affine transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be flat, but in practice the deviation is significant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in figure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “filter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any finite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difficult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustified versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof first; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces infinite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisfied by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error specifies the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the final square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By definition, the null-space basis satisfies Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive definite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We first extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the first criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it suffices to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the first criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the first criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by finite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the first and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we find that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).
4 0.16770193 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen
Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1
5 0.15941122 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky
Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1
6 0.11892293 130 nips-2003-Model Uncertainty in Classical Conditioning
7 0.10650364 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
8 0.092405766 68 nips-2003-Eye Movements for Reward Maximization
9 0.090661354 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
10 0.083407931 100 nips-2003-Laplace Propagation
11 0.082020976 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
12 0.081754953 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
13 0.078701854 99 nips-2003-Kernels for Structured Natural Language Data
14 0.076394193 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
15 0.070245616 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
16 0.06883692 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
17 0.068723574 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
18 0.065939143 78 nips-2003-Gaussian Processes in Reinforcement Learning
19 0.062681362 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
20 0.061754692 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
topicId topicWeight
[(0, -0.202), (1, 0.09), (2, 0.048), (3, 0.064), (4, 0.049), (5, 0.072), (6, 0.129), (7, -0.104), (8, -0.062), (9, -0.207), (10, 0.012), (11, -0.173), (12, 0.052), (13, -0.132), (14, 0.16), (15, 0.119), (16, -0.103), (17, 0.057), (18, 0.019), (19, 0.08), (20, -0.013), (21, -0.132), (22, 0.156), (23, 0.192), (24, 0.064), (25, -0.055), (26, 0.122), (27, -0.045), (28, 0.001), (29, -0.048), (30, 0.024), (31, 0.099), (32, 0.045), (33, -0.303), (34, -0.086), (35, -0.017), (36, -0.095), (37, -0.004), (38, -0.055), (39, -0.049), (40, -0.001), (41, 0.162), (42, -0.062), (43, -0.0), (44, 0.077), (45, 0.138), (46, -0.077), (47, 0.104), (48, 0.155), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.98835337 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
2 0.51416689 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen
Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1
3 0.48606667 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky
Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1
4 0.48518622 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
5 0.44691116 130 nips-2003-Model Uncertainty in Classical Conditioning
Author: Aaron C. Courville, Geoffrey J. Gordon, David S. Touretzky, Nathaniel D. Daw
Abstract: We develop a framework based on Bayesian model averaging to explain how animals cope with uncertainty about contingencies in classical conditioning experiments. Traditional accounts of conditioning fit parameters within a fixed generative model of reinforcer delivery; uncertainty over the model structure is not considered. We apply the theory to explain the puzzling relationship between second-order conditioning and conditioned inhibition, two similar conditioning regimes that nonetheless result in strongly divergent behavioral outcomes. According to the theory, second-order conditioning results when limited experience leads animals to prefer a simpler world model that produces spurious correlations; conditioned inhibition results when a more complex model is justified by additional experience. 1
6 0.43440771 128 nips-2003-Minimax Embeddings
7 0.37198633 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
8 0.36682376 68 nips-2003-Eye Movements for Reward Maximization
9 0.35900265 175 nips-2003-Sensory Modality Segregation
10 0.35524714 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
11 0.34071556 99 nips-2003-Kernels for Structured Natural Language Data
12 0.2997936 100 nips-2003-Laplace Propagation
13 0.27307028 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
14 0.26220185 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
15 0.25120792 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis
16 0.24083012 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
17 0.2361729 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
18 0.22858788 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
19 0.22803842 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
20 0.22522911 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
topicId topicWeight
[(0, 0.037), (11, 0.017), (29, 0.046), (30, 0.025), (35, 0.044), (53, 0.107), (55, 0.161), (69, 0.018), (71, 0.058), (76, 0.053), (82, 0.01), (85, 0.092), (91, 0.206), (99, 0.025)]
simIndex simValue paperId paperTitle
1 0.94711244 8 nips-2003-A Holistic Approach to Compositional Semantics: a connectionist model and robot experiments
Author: Yuuya Sugita, Jun Tani
Abstract: We present a novel connectionist model for acquiring the semantics of a simple language through the behavioral experiences of a real robot. We focus on the “compositionality” of semantics, a fundamental characteristic of human language, which is the ability to understand the meaning of a sentence as a combination of the meanings of words. We also pay much attention to the “embodiment” of a robot, which means that the robot should acquire semantics which matches its body, or sensory-motor system. The essential claim is that an embodied compositional semantic representation can be self-organized from generalized correspondences between sentences and behavioral patterns. This claim is examined and confirmed through simple experiments in which a robot generates corresponding behaviors from unlearned sentences by analogy with the correspondences between learned sentences and behaviors. 1
same-paper 2 0.92987901 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
3 0.92814803 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
Author: Eiji Mizutani, James Demmel
Abstract: The online incremental gradient (or backpropagation) algorithm is widely considered to be the fastest method for solving large-scale neural-network (NN) learning problems. In contrast, we show that an appropriately implemented iterative batch-mode (or block-mode) learning method can be much faster. For example, it is three times faster in the UCI letter classification problem (26 outputs, 16,000 data items, 6,066 parameters with a two-hidden-layer multilayer perceptron) and 353 times faster in a nonlinear regression problem arising in color recipe prediction (10 outputs, 1,000 data items, 2,210 parameters with a neuro-fuzzy modular network). The three principal innovative ingredients in our algorithm are the following: First, we use scaled trust-region regularization with inner-outer iteration to solve the associated “overdetermined” nonlinear least squares problem, where the inner iteration performs a truncated (or inexact) Newton method. Second, we employ Pearlmutter’s implicit sparse Hessian matrix-vector multiply algorithm to construct the Krylov subspaces used to solve for the truncated Newton update. Third, we exploit sparsity (for preconditioning) in the matrices resulting from the NNs having many outputs. 1
4 0.82120347 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
Author: Gerald Tesauro
Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1
5 0.81573343 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
6 0.81083721 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
8 0.80608046 68 nips-2003-Eye Movements for Reward Maximization
9 0.79904807 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
10 0.79755306 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
11 0.79453057 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time
12 0.79384077 107 nips-2003-Learning Spectral Clustering
13 0.79107279 78 nips-2003-Gaussian Processes in Reinforcement Learning
14 0.78739274 73 nips-2003-Feature Selection in Clustering Problems
15 0.7872501 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
16 0.78673208 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
17 0.78656018 87 nips-2003-Identifying Structure across Pre-partitioned Data
18 0.78626472 55 nips-2003-Distributed Optimization in Adaptive Networks
19 0.78611684 30 nips-2003-Approximability of Probability Distributions
20 0.78269136 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering