nips nips2002 nips2002-179 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: J. L. Shapiro
Abstract: Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The learning rate must scale with the system size in a problem-dependent way. [sent-9, score-0.114]
2 This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. [sent-10, score-0.47]
3 A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. [sent-12, score-0.42]
4 An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. [sent-13, score-0.258]
5 Usually, the probability model is a parameterized graphical model , and updating the model involves changing the parameters and possibly the structure of the model. [sent-17, score-0.093]
6 EDAs are related to genetic algorithms; instead of evolving a population, a generative model which produces the population at each generation is evolved. [sent-25, score-0.211]
7 A motivation for using EDAs instead of GAs is that is that in EDAs the structure of the graphical model corresponds to the form of the crossover operator in GAs (in the sense that a given graph will produce data whose probability will not change much under a particular crossover operator). [sent-26, score-0.268]
8 If the EDA can learn the structure of the graph, it removes the need to set the crossover operator by hand (but see [2] for evidence against this). [sent-27, score-0.144]
9 The learning rate must vanish with the system size in a problem dependent way, and for some problems it has to vanish exponentially fast. [sent-30, score-0.302]
10 Two correctives measures are considered: a new learning rule which obeys detailed balance in the space of parameters, and an operator analogous to mutation which has been proposed previously. [sent-31, score-0.72]
11 The probability model is defined by the L-component vector of parameters 'Y ~), where 'Yi(t ) denotes the probability that Xi = 1 at time t. [sent-35, score-0.118]
12 The algorithm works as follows, • Initialize 'Yi(O) = 1/2 for all i; • Repeat - Generate a population of N strings by sampling from the binomial distribution defined by 1(t). [sent-36, score-0.343]
13 • until (stopping criterion met) The algorithm has only two parameters, the size of the population N and the learning parameter a. [sent-39, score-0.151]
14 1 The sensitivity of PBIL to the learning rate PBIL on a flat landscape The source of sensitivity of PBIL to the learning rate lies in its behavior on a flat landscape. [sent-41, score-0.528]
15 As the parameters deviate from 1/2 they will move towards a corner of the hypercube. [sent-44, score-0.262]
16 Then the population generated will be biased towards that corner, which will move the parameters closer yet to that corner, etc. [sent-45, score-0.328]
17 All of the corners of the hypercube are attractors which, although never reached, are increasingly attractive with increasing proximity. [sent-46, score-0.132]
18 (In population genetics, the term drift refers to the loss of genetic diversity due to finite population sampling. [sent-48, score-0.445]
19 (3) The rate of search on any other search space will have to compete with drift. [sent-51, score-0.155]
20 2 PBIL and the needle-in-the haystack problem As a simple example of the interplay between drift and directed search, consider the so-called needle-in-a-haystack problem. [sent-53, score-0.155]
21 Here the fitness of all strings is 0 except for one special string (the "needle") which has a fitness of 1. [sent-54, score-0.584]
22 Assume it is the string of all 1 'so It is shown here that PBIL will only find the needle if 0: is exponentially small, and is inefficient at finding the needle when compared to random search. [sent-55, score-0.819]
23 Consider the probability of finding the needle at time t, denoted O(t) = Consider times shorter than T where T is long enough that the needle may be found multiple times, but 0:2T -+ 0 as L -+ 00. [sent-57, score-0.659]
24 It will be shown for small 0: that when the needle is not found (during drift), 0 decreases by an amount 0: 2LO/2, whereas when the needle is found, 0 increases by the amount o:LO. [sent-58, score-0.634]
25 Since initially, the former happens at a rate 2L times greater than the latter, 0: must be less than 2 - (L - 1) for the system to move towards the hypercube corner near the optimum, rather than towards a random corner. [sent-59, score-0.58]
26 When the needle is not found, the mean of O(t) is invariant, (O(t + 1)) = O(t). [sent-60, score-0.29]
27 However, this is misleading, because 0 is not a self-averaging quantity; its mean is affected by exponentially unlikely events which have an exponentially big effect. [sent-61, score-0.092]
28 The recursion for 0 expanded to second order in 0: obeys [O(t + 1)] = { [O(t)] [1 [O(t)] [1 - 10:2 L] . [sent-65, score-0.111]
29 + ~L + ~'0:2 L(L - needle not found 1)] ; needle found. [sent-66, score-0.603]
30 Since the needle will be found with probability O(t) and not found with probability 1 - O(t), the recursion averages to, [O(t + 1)] = [O(t)] (1 - ~0:2 L) + [0(t)]2 [O:L - ~0:2 L(L + 1)] . [sent-68, score-0.421]
31 Equation (5) has a stable fixed point at 0 and an unstable fixed point at 0:/2 + O( 0: 2 L). [sent-70, score-0.212]
32 If the initial value of D(O) is less than the unstable fixed point, D will decay to zero. [sent-71, score-0.125]
33 If D(O) is greater than the unstable fixed point, D will grow. [sent-72, score-0.162]
34 The initial value is D(O) = 2- £, so the condition for the likelihood of finding the needle to increase rather than decrease is 0: < 2-(£-1). [sent-73, score-0.342]
35 The algorithm is run until no parameters are between 0. [sent-76, score-0.096]
36 Left: Fitness of best population member at convergence versus 0:. [sent-79, score-0.204]
37 The non-robustness of the algorithm is clear; as L increases, 0: must be very finely set to a very small value to find the optimum. [sent-80, score-0.152]
38 The data approximately collapses, which shows that as L increases, 0: must decrease like 2-£ to get the same performance. [sent-82, score-0.101]
39 These confirm the predictions made above, the optimum is found only if 0: is smaller than a constant times 2£. [sent-84, score-0.326]
40 The algorithm is inefficient because it requires such small 0:; convergence to the optimum scales like 4£. [sent-85, score-0.385]
41 This is because the rate of convergence to the optimum goes like Do:, both of which are 0(2-£). [sent-86, score-0.374]
42 3 PBIL and functions of unitation One might think that the needle-in-the-haystack problem is hard in a special way, and results on this problem are not relevant to other problems. [sent-88, score-0.121]
43 Assume the the optimum occurs when all components are l. [sent-91, score-0.261]
44 t The parameters 1 can be decomposed into components parallel and perpendicular to the optimum. [sent-92, score-0.091]
45 Movement along the perpendicular direction is neutral, Only movement towards or away from the optimum changes the fitness. [sent-93, score-0.514]
46 The random strings generated at the start of the algorithm are almost entirely perpendicular to the global optimum, projecting only an amount of order 1/. [sent-94, score-0.166]
47 The perpendicular direction is fiat, so there is convergence towards an arbitrary hypercube corner with a drift rate, TJ. [sent-98, score-0.478]
48 Movement towards the global optimum occurs at a rate, a Til '" VL· (7) Thus, a must be small compared to l/VL for movement towards the global optimum to win. [sent-102, score-0.96]
49 A rough argument can be used to show how the fitness in the final population depends on a. [sent-103, score-0.4]
50 Assuming that the convergence of the variance is primarily due to the convergence on the flat subspace, this can be solved as, (u(oo)) ~ Jlog(N) 1 "2 + aV'iL . [sent-105, score-0.273]
51 (9) The equation must break down when the fitness approaches one, which is where the Gaussian approximation to the binomial breaks down. [sent-106, score-0.405]
52 8 20 a Figure 2: Simulations on PBIL on the unitation function for L = 16,32,64,128,256 (respectively D , 0, +, *, 6) . [sent-130, score-0.121]
53 The algorithm is run until all parameters are closer to 1 or 0 than 0. [sent-131, score-0.096]
54 Left: Fitness of best population member at convergence versus a. [sent-133, score-0.204]
55 The fitness is scaled so that the global optimum has fitness 1 and the expected fitness of a random string is O. [sent-134, score-1.117]
56 As L increases, a must be set to a decreasing value to find the optimum. [sent-135, score-0.128]
57 The data approximately collapses, which shows t hat as L increases, a must decrease like VL to get the same performance. [sent-137, score-0.101]
58 Simulations of PBIL on the unitation function confirm these predictions. [sent-139, score-0.163]
59 PBIL fails to converge to the global optimum unless a is small compared to l/VL. [sent-140, score-0.301]
60 Figure 2 shows the scaling of fitness at convergence with aVL, and compares simulations with equation (9). [sent-141, score-0.427]
61 4 Corrective 1 - Detailed Balance PBIL One view of the problem is that it is due to the fact that the learning dynamics does not obey detailed balance. [sent-142, score-0.214]
62 Even on a flat space, the rate of movement of the parameters "Yi away from 1/2 is greater than the movement back. [sent-143, score-0.465]
63 It is wellknown that a Markov process on variables x will converge to a desired equilibrium distribution 7r(x) if the transition probabilities obey the detailed balance conditions, w(x'lx)7r(x) = w(xlx')7r(x'), (10) where w(x'lx) is the probability of generating x' from x. [sent-144, score-0.369]
64 Thus, any search algorithm searching on a flat space should have dynamics which obeys, w(x'lx) = w(xlx'), (11) and PEIL does not obey this. [sent-145, score-0.339]
65 There is a difficulty in modifying the dynamics of PBIL to satisfy detailed balance, however. [sent-147, score-0.174]
66 This can be fixed by constraining the parameters to lie on a lattice. [sent-149, score-0.119]
67 Then the dynamics can be altered to enforce detailed balance. [sent-150, score-0.174]
68 2' Learning dynamics now consists of incrementing and decrementing the n/s by 1; when xi = 1(0) ni is incremented (decremented). [sent-155, score-0.17]
69 1 Detailed balance by rejection sampling One of the easiest methods for sampling from a distribution is to use the rejection method. [sent-159, score-0.303]
70 Detailed balance condition becomes g(x'lx)A(x'lx)7r(x) = g(xlx')A(xlx')7r(x') . [sent-162, score-0.162]
71 4 equally spaced for L = 8,9,10,11,12; 1000 trials of each, population size 20, with the same convergence criterion as before, simulation halts when all "Ii'S are less than 0. [sent-171, score-0.179]
72 On none of those simulations did the algorithm fail to contain the global optimum in the final population. [sent-174, score-0.402]
73 For the function of unitation, Detailed Balance PBIL appears to always find the optimum if run long enough. [sent-175, score-0.376]
74 95), the algorithm did not always find the global optimum. [sent-178, score-0.139]
75 It produced an average fitness within 1% of the optimum for (): between 0. [sent-179, score-0.51]
76 1 and L = 256 the average fitness fell as low as 4% below optimum. [sent-182, score-0.297]
77 However, this is much improved over standard PBIL (see figure 2) where the average fitness fell to 60% below the optimum in that range. [sent-183, score-0.583]
78 5 Corrective 2 - Probabilistic mutation Another approach to control drift is to add an operator analogous to mutation in GAs. [sent-184, score-0.615]
79 Muhlenbein [5] has proposed that the analogous operator ED As estimates frequencies biased towards a random guess. [sent-186, score-0.249]
80 Then, the appropriate estimate of the probability of a 1 at site i is ii + m (18) "Ii = 1 + 2m' where m is a mutation-like parameter. [sent-188, score-0.102]
81 This will be recognized as the maximum posterior estimate of the binomial distribution using as the prior a ,a-distribution with both parameters equal to mN + 1; the prior biases the estimate towards 1/2. [sent-189, score-0.233]
82 (19) With m = 0 it gives the usual PBIL rule; when repeatedly applied on a flat space it converges to 1/2. [sent-191, score-0.219]
83 First, mutation must be large enough to counteract the effects of drift towards random corners of the hypercube. [sent-194, score-0.512]
84 Thus, the fixed point of the average distance to 1/2, (D(t + 1)) defined in equation (2) , must be sufficiently close to zero. [sent-195, score-0.232]
85 Second, mutation must be small enough that it does not interfere with movement towards the parameters near the optimum when the optimum is found. [sent-196, score-1.022]
86 Thus, the fixed point of equation (19) must be sufficiently close to 0 or 1. [sent-197, score-0.204]
87 Finally, a sample of size N sampled from the fixed point distribution near the hypercube corner containing the optimum should contain the optimum with a reasonable probability (say greater than 1 - e- 1 ). [sent-198, score-0.871]
88 1 Results To satisfy the conditions in equation 20, the mutation rate was set to m ex: a 2 , and a was constrained to be smaller than log (N)/L. [sent-201, score-0.294]
89 It never failed to find the optimum for the needle-in-a-haystack problems for the sizes given previously. [sent-203, score-0.36]
90 For the functions of unitation, no improvement over standard PBIL is expected, since the scaling using mutation is worse, requiring a < 1/ L rather than a < 1/. [sent-204, score-0.234]
91 However, with tuning of the mutation rate, the range of a's with which the optimum was always found could be increased over standard PBIL. [sent-207, score-0.481]
92 6 Conclusions The learning rate of PBIL has to be very small for the algorithm to work, and unpredictably so as it depends upon the problem size in a problem dependent way. [sent-208, score-0.085]
93 Detailed balance fixed the problem dramatically in the two cases studied. [sent-210, score-0.249]
94 Using detailed balance, the algorithm consistently finds the optimum over the entire range of learning rates. [sent-211, score-0.4]
95 Mutation also fixed the problem when the parameters were chosen to satisfy a problem-independent set of inequalities. [sent-212, score-0.119]
96 The phenomenon studied here could hold in any EDA, because for any type of model, the probability is high of generating a population which reinforces the move just made. [sent-213, score-0.213]
97 Of the proposed correctives, detailed balance will be more difficult to generalize to models in which the structure is learned. [sent-216, score-0.318]
98 It requires an understanding of algorithm's dynamics on a flat space, which may be very difficult to find in those cases. [sent-217, score-0.344]
99 The mutation-type operator will easier to generalize, because it only requires a bias towards a random distribution. [sent-218, score-0.192]
100 Population-based incremental learning: A method for integrating genetic search based function optimization and competive learning. [sent-222, score-0.167]
wordName wordTfidf (topN-words)
[('pbil', 0.604), ('needle', 0.29), ('optimum', 0.261), ('fitness', 0.249), ('mutation', 0.197), ('flat', 0.169), ('balance', 0.162), ('population', 0.127), ('unitation', 0.121), ('detailed', 0.115), ('towards', 0.111), ('drift', 0.107), ('edas', 0.097), ('eda', 0.096), ('fixed', 0.087), ('corner', 0.085), ('genetic', 0.084), ('ni', 0.084), ('obeys', 0.084), ('xlx', 0.084), ('movement', 0.083), ('operator', 0.081), ('find', 0.075), ('peil', 0.072), ('vanish', 0.071), ('binomial', 0.067), ('hypercube', 0.064), ('crossover', 0.063), ('rate', 0.061), ('perpendicular', 0.059), ('dynamics', 0.059), ('simulations', 0.053), ('must', 0.053), ('convergence', 0.052), ('ii', 0.049), ('correctives', 0.048), ('fiat', 0.048), ('gas', 0.048), ('haystack', 0.048), ('inefficient', 0.048), ('manchester', 0.048), ('shapiro', 0.048), ('fell', 0.048), ('search', 0.047), ('exponentially', 0.046), ('stopping', 0.044), ('corners', 0.044), ('string', 0.043), ('strings', 0.043), ('confirm', 0.042), ('difficult', 0.041), ('obey', 0.04), ('global', 0.04), ('run', 0.04), ('collapses', 0.038), ('vl', 0.038), ('corrective', 0.038), ('unstable', 0.038), ('scaling', 0.037), ('greater', 0.037), ('illinois', 0.036), ('incremental', 0.036), ('equation', 0.036), ('move', 0.034), ('sensitivity', 0.034), ('analogous', 0.033), ('parameters', 0.032), ('graphical', 0.032), ('sampling', 0.031), ('increases', 0.031), ('root', 0.03), ('probability', 0.029), ('sufficiently', 0.028), ('rejection', 0.028), ('sensitive', 0.028), ('defined', 0.028), ('recursion', 0.027), ('met', 0.027), ('xi', 0.027), ('finding', 0.027), ('evolutionary', 0.027), ('scaled', 0.026), ('decrease', 0.025), ('solutions', 0.025), ('improved', 0.025), ('repeatedly', 0.025), ('member', 0.025), ('converges', 0.025), ('near', 0.024), ('never', 0.024), ('removing', 0.024), ('biased', 0.024), ('algorithm', 0.024), ('smooth', 0.024), ('final', 0.024), ('site', 0.024), ('distribution', 0.023), ('approximately', 0.023), ('phenomenon', 0.023), ('found', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
Author: J. L. Shapiro
Abstract: Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. 1
2 0.055686399 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
3 0.054477438 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
Author: Willem H. Zuidema
Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -
4 0.050004765 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
5 0.048307892 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
Author: Patrik O. Hoyer, Aapo Hyvärinen
Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying significantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Specifically, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.
6 0.04506864 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
7 0.043317772 103 nips-2002-How Linear are Auditory Cortical Responses?
8 0.043107949 111 nips-2002-Independent Components Analysis through Product Density Estimation
9 0.042798433 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
10 0.041604586 119 nips-2002-Kernel Dependency Estimation
11 0.041392341 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
12 0.040611394 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
13 0.04008089 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
14 0.039921749 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli
15 0.03910492 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
16 0.038473893 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
17 0.036550753 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
18 0.036097508 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
19 0.036097173 128 nips-2002-Learning a Forward Model of a Reflex
20 0.036026258 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
topicId topicWeight
[(0, -0.127), (1, 0.005), (2, -0.033), (3, -0.019), (4, -0.015), (5, 0.011), (6, -0.024), (7, 0.041), (8, -0.019), (9, 0.009), (10, 0.007), (11, -0.029), (12, 0.061), (13, 0.007), (14, -0.037), (15, -0.013), (16, -0.001), (17, 0.016), (18, -0.039), (19, -0.045), (20, -0.01), (21, 0.047), (22, -0.008), (23, -0.025), (24, 0.066), (25, -0.034), (26, 0.043), (27, -0.08), (28, 0.039), (29, -0.016), (30, -0.039), (31, 0.052), (32, -0.088), (33, -0.048), (34, -0.026), (35, 0.13), (36, 0.031), (37, -0.031), (38, 0.094), (39, 0.033), (40, 0.072), (41, 0.105), (42, 0.161), (43, -0.028), (44, 0.019), (45, -0.003), (46, -0.066), (47, -0.017), (48, -0.086), (49, 0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.89731699 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
Author: J. L. Shapiro
Abstract: Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. 1
2 0.53989899 111 nips-2002-Independent Components Analysis through Product Density Estimation
Author: Trevor Hastie, Rob Tibshirani
Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1
3 0.45750919 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
Author: Willem H. Zuidema
Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -
4 0.45080361 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
Author: Emanuel Todorov, Michael I. Jordan
Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.
5 0.43207181 47 nips-2002-Branching Law for Axons
Author: Dmitri B. Chklovskii, Armen Stepanyants
Abstract: What determines the caliber of axonal branches? We pursue the hypothesis that the axonal caliber has evolved to minimize signal propagation delays, while keeping arbor volume to a minimum. We show that for a general cost function the optimal diameters of mother (do) and daughter (d], d 2 ) branches at a bifurcation obey v v 路 d
6 0.40632582 178 nips-2002-Robust Novelty Detection with Single-Class MPM
7 0.37499776 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
8 0.35591406 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
10 0.35354647 42 nips-2002-Bias-Optimal Incremental Problem Solving
11 0.34895319 85 nips-2002-Fast Kernels for String and Tree Matching
12 0.34226364 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
13 0.33827806 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
14 0.33592248 96 nips-2002-Generalized² Linear² Models
15 0.33414483 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
16 0.33214435 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
17 0.3255589 185 nips-2002-Speeding up the Parti-Game Algorithm
18 0.31886238 194 nips-2002-The Decision List Machine
19 0.31856412 167 nips-2002-Rational Kernels
20 0.31139028 190 nips-2002-Stochastic Neighbor Embedding
topicId topicWeight
[(11, 0.017), (14, 0.012), (23, 0.018), (42, 0.07), (54, 0.121), (55, 0.042), (57, 0.023), (61, 0.33), (67, 0.029), (68, 0.027), (74, 0.077), (92, 0.035), (98, 0.101)]
simIndex simValue paperId paperTitle
1 0.85140544 182 nips-2002-Shape Recipes: Scene Representations that Refer to the Image
Author: William T. Freeman, Antonio Torralba
Abstract: The goal of low-level vision is to estimate an underlying scene, given an observed image. Real-world scenes (eg, albedos or shapes) can be very complex, conventionally requiring high dimensional representations which are hard to estimate and store. We propose a low-dimensional representation, called a scene recipe, that relies on the image itself to describe the complex scene configurations. Shape recipes are an example: these are the regression coefficients that predict the bandpassed shape from image data. We describe the benefits of this representation, and show two uses illustrating their properties: (1) we improve stereo shape estimates by learning shape recipes at low resolution and applying them at full resolution; (2) Shape recipes implicitly contain information about lighting and materials and we use them for material segmentation.
same-paper 2 0.73361158 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
Author: J. L. Shapiro
Abstract: Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. 1
3 0.51104724 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
4 0.5099026 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
5 0.50890923 10 nips-2002-A Model for Learning Variance Components of Natural Images
Author: Yan Karklin, Michael S. Lewicki
Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.
6 0.5075863 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
7 0.50697726 3 nips-2002-A Convergent Form of Approximate Policy Iteration
8 0.50672066 53 nips-2002-Clustering with the Fisher Score
9 0.50667655 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
10 0.50529552 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
11 0.50454992 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
12 0.50440174 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
13 0.50392985 46 nips-2002-Boosting Density Estimation
14 0.50386715 124 nips-2002-Learning Graphical Models with Mercer Kernels
15 0.50322133 27 nips-2002-An Impossibility Theorem for Clustering
16 0.50269616 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
17 0.50249958 169 nips-2002-Real-Time Particle Filters
18 0.50220495 2 nips-2002-A Bilinear Model for Sparse Coding
19 0.50074309 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
20 0.50069124 135 nips-2002-Learning with Multiple Labels