nips nips2011 knowledge-graph by maker-knowledge-mining

nips 2011 knowledge graph

papers list:

1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classiﬁers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by deﬁning a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

Author: Julie Dethier, Paul Nuyujukian, Chris Eliasmith, Terrence C. Stewart, Shauki A. Elasaad, Krishna V. Shenoy, Kwabena A. Boahen

Abstract: Motor prostheses aim to restore function to disabled patients. Despite compelling proof of concept systems, barriers to clinical translation remain. One challenge is to develop a low-power, fully-implantable system that dissipates only minimal power so as not to damage tissue. To this end, we implemented a Kalman-ﬁlter based decoder via a spiking neural network (SNN) and tested it in brain-machine interface (BMI) experiments with a rhesus monkey. The Kalman ﬁlter was trained to predict the arm’s velocity and mapped on to the SNN using the Neural Engineering Framework (NEF). A 2,000-neuron embedded Matlab SNN implementation runs in real-time and its closed-loop performance is quite comparable to that of the standard Kalman ﬁlter. The success of this closed-loop decoder holds promise for hardware SNN implementations of statistical signal processing algorithms on neuromorphic chips, which may offer power savings necessary to overcome a major obstacle to the successful clinical translation of neural motor prostheses. ∗ Present: Research Fellow F.R.S.-FNRS, Systmod Unit, University of Liege, Belgium. 1 1 Cortically-controlled motor prostheses: the challenge Motor prostheses aim to restore function for severely disabled patients by translating neural signals from the brain into useful control signals for prosthetic limbs or computer cursors. Several proof of concept demonstrations have shown encouraging results, but barriers to clinical translation still remain. One example is the development of a fully-implantable system that meets power dissipation constraints, but is still powerful enough to perform complex operations. A recently reported closedloop cortically-controlled motor prosthesis is capable of producing quick, accurate, and robust computer cursor movements by decoding neural signals (threshold-crossings) from a 96-electrode array in rhesus macaque premotor/motor cortex [1]-[4]. This, and previous designs (e.g., [5]), employ versions of the Kalman ﬁlter, ubiquitous in statistical signal processing. Such a ﬁlter and its variants are the state-of-the-art decoder for brain-machine interfaces (BMIs) in humans [5] and monkeys [2]. While these recent advances are encouraging, clinical translation of such BMIs requires fullyimplanted systems, which in turn impose severe power dissipation constraints. Even though it is an open, actively-debated question as to how much of the neural prosthetic system must be implanted, we note that there are no reports to date demonstrating a fully implantable 100-channel wireless transmission system, motivating performing decoding within the implanted chip. This computation is constrained by a stringent power budget: A 6 × 6mm2 implant must dissipate less than 10mW to avoid heating the brain by more than 1◦ C [6], which is believed to be important for long term cell health. With this power budget, current approaches can not scale to higher electrode densities or to substantially more computer-intensive decode/control algorithms. The feasibility of mapping a Kalman-ﬁlter based decoder algorithm [1]-[4] on to a spiking neural network (SNN) has been explored off-line (open-loop). In these off-line tests, the SNN’s performance virtually matched that of the standard implementation [7]. These simulations provide conﬁdence that this algorithm—and others similar to it—could be implemented using an ultra-low-power approach potentially capable of meeting the severe power constraints set by clinical translation. This neuromorphic approach uses very-large-scale integrated systems containing microelectronic analog circuits to morph neural systems into silicon chips [8, 9]. These neuromorphic circuits may yield tremendous power savings—50nW per silicon neuron [10]—over digital circuits because they use physical operations to perform mathematical computations (analog approach). When implemented on a chip designed using the neuromorphic approach, a 2,000-neuron SNN network can consume as little as 100µW. Demonstrating this approach’s feasibility in a closed-loop system running in real-time is a key, non-incremental step in the development of a fully implantable decoding chip, and is necessary before proceeding with fabricating and implanting the chip. As noise, delay, and over-ﬁtting play a more important role in the closed-loop setting, it is not obvious that the SNN’s stellar open-loop performance will hold up. In addition, performance criteria are different in the closed-loop and openloop settings (e.g., time per target vs. root mean squared error). Therefore, a SNN of a different size may be required to meet the desired speciﬁcations. Here we present results and assess the performance and viability of the SNN Kalman-ﬁlter based decoder in real-time, closed-loop tests, with the monkey performing a center-out-and-back target acquisition task. To achieve closed-loop operation, we developed an embedded Matlab implementation that ran a 2,000-neuron version of the SNN in real-time on a PC. We achieved almost a 50-fold speed-up by performing part of the computation in a lower-dimensional space deﬁned by the formal method we used to map the Kalman ﬁlter on to the SNN. This shortcut allowed us to run a larger SNN in real-time than would otherwise be possible. 2 Spiking neural network mapping of control theory algorithms As reported in [11], a formal methodology, called the Neural Engineering Framework (NEF), has been developed to map control-theory algorithms onto a computational fabric consisting of a highly heterogeneous population of spiking neurons simply by programming the strengths of their connections. These artiﬁcial neurons are characterized by a nonlinear multi-dimensional-vector-to-spikerate function—ai (x(t)) for the ith neuron—with parameters (preferred direction, maximum ﬁring rate, and spiking-threshold) drawn randomly from a wide distribution (standard deviation ≈ mean). 2 Spike rate (spikes/s) Representation ˆ x → ai (x) → x = ∑i ai (x)φix ˜ ai (x) = G(αi φix · x + Jibias ) 400 Transformation y = Ax → b j (Aˆ ) x Aˆ = ∑i ai (x)Aφix x x(t) B' y(t) A' 200 0 −1 Dynamics ˙ x = Ax → x = h ∗ A x A = τA + I 0 Stimulus x 1 bk(t) y(t) B' h(t) x(t) A' aj(t) Figure 1: NEF’s three principles. Representation. 1D tuning curves of a population of 50 leaky integrate-and-ﬁre neurons. The neurons’ tuning curves map control variables (x) to spike rates (ai (x)); this nonlinear transformation is inverted by linear weighted decoding. G() is the neurons’ nonlinear current-to-spike-rate function. Transformation. SNN with populations bk (t) and a j (t) representing y(t) and x(t). Feedforward and recurrent weights are determined by B and A , as described next. Dynamics. The system’s dynamics is captured in a neurally plausible fashion by replacing integration with the synapses’ spike response, h(t), and replacing the matrices with A = τA + I and B = τB to compensate. The neural engineering approach to conﬁguring SNNs to perform arbitrary computations is underlined by three principles (Figure 1) [11]-[14]: Representation is deﬁned by nonlinear encoding of x(t) as a spike rate, ai (x(t))—represented by the neuron tuning curve—combined with optimal weighted linear decoding of ai (x(t)) to recover ˆ an estimate of x(t), x(t) = ∑i ai (x(t))φix , where φix are the decoding weights. Transformation is performed by using alternate decoding weights in the decoding operation to map transformations of x(t) directly into transformations of ai (x(t)). For example, y(t) = Ax(t) is represented by the spike rates b j (Aˆ (t)), where unit j’s input is computed directly from unit i’s x output using Aˆ (t) = ∑i ai (x(t))Aφix , an alternative linear weighting. x Dynamics brings the ﬁrst two principles together and adds the time dimension to the circuit. This principle aims at reuniting the control-theory and neural levels by modifying the matrices to render the system neurally plausible, thereby permitting the synapses’ spike response, h(t), (i.e., impulse ˙ response) to capture the system’s dynamics. For example, for h(t) = τ −1 e−t/τ , x = Ax(t) is realized by replacing A with A = τA + I. This so-called neurally plausible matrix yields an equivalent dynamical system: x(t) = h(t) ∗ A x(t), where convolution replaces integration. The nonlinear encoding process—from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji (x(t)), to a ﬁring rate, ai (x(t))—is speciﬁed as: ai (x(t)) = G(Ji (x(t))). (1) Here G is the neurons’ nonlinear current-to-spike-rate function, which is given by G(Ji (x)) = τ ref − τ RC ln (1 − Jth /Ji (x)) −1 , (2) for the leaky integrate-and-ﬁre model (LIF). The LIF neuron has two behavioral regimes: subthreshold and super-threshold. The sub-threshold regime is described by an RC circuit with time constant τ RC . When the sub-threshold soma voltage reaches the threshold, Vth , the neuron emits a spike δ (t −tn ). After this spike, the neuron is reset and rests for τ ref seconds (absolute refractory period) before it resumes integrating. Jth = Vth /R is the minimum input current that produces spiking. Ignoring the soma’s RC time-constant when specifying the SNN’s dynamics are reasonable because the neurons cross threshold at a rate that is proportional to their input current, which thus sets the spike rate instantaneously, without any ﬁltering [11]. The conversion from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji , is ˜ performed by assigning to the neuron a preferred direction, φix , in the stimulus space and taking the dot-product: ˜ Ji (x(t)) = αi φix · x(t) + Jibias , (3) 3 where αi is a gain or conversion factor, and Jibias is a bias current that accounts for background ˜ activity. For a 1D space, φix is either +1 or −1 (drawn randomly), for ON and OFF neurons, respectively. The resulting tuning curves are illustrated in Figure 1, left. The linear decoding process is characterized by the synapses’ spike response, h(t) (i.e., post-synaptic currents), and the decoding weights, φix , which are obtained by minimizing the mean square error. A single noise term, η, takes into account all sources of noise, which have the effect of introducing uncertainty into the decoding process. Hence, the transmitted ﬁring rate can be written as ai (x(t)) + ηi , where ai (x(t)) represents the noiseless set of tuning curves and ηi is a random variable picked from a zero-mean Gaussian distribution with variance σ 2 . Consequently, the mean square error can be written as [11]: E = 1 ˆ [x(t) − x(t)]2 2 x,η,t = 2 1 2 x(t) − ∑ (ai (x(t)) + ηi ) φix i (4) x,η,t where · x,η denotes integration over the range of x and η, the expected noise. We assume that the noise is independent and has the same variance for each neuron [11], which yields: E= where σ2 1 2 2 x(t) − ∑ ai (x(t))φix i x,t 1 + σ 2 ∑(φix )2 , 2 i (5) is the noise variance ηi η j . This expression is minimized by: N φix = ∑ Γ−1 ϒ j , ij (6) j with Γi j = ai (x)a j (x) x + σ 2 δi j , where δ is the Kronecker delta function matrix, and ϒ j = xa j (x) x [11]. One consequence of modeling noise in the neural representation is that the matrix Γ is invertible despite the use of a highly overcomplete representation. In a noiseless representation, Γ is generally singular because, due to the large number of neurons, there is a high probability of having two neurons with similar tuning curves leading to two similar rows in Γ. 3 Kalman-ﬁlter based cortical decoder In the 1960’s, Kalman described a method that uses linear ﬁltering to track the state of a dynamical system throughout time using a model of the dynamics of the system as well as noisy measurements [15]. The model dynamics gives an estimate of the state of the system at the next time step. This estimate is then corrected using the observations (i.e., measurements) at this time step. The relative weights for these two pieces of information are given by the Kalman gain, K [15, 16]. Whereas the Kalman gain is updated at each iteration, the state and observation matrices (deﬁned below)—and corresponding noise matrices—are supposed constant. In the case of prosthetic applications, the system’s state vector is the cursor’s kinematics, xt = y [veltx , velt , 1], where the constant 1 allows for a ﬁxed offset compensation. The measurement vector, yt , is the neural spike rate (spike counts in each time step) of 192 channels of neural threshold crossings. The system’s dynamics is modeled by: xt yt = Axt−1 + wt , = Cxt + qt , (7) (8) where A is the state matrix, C is the observation matrix, and wt and qt are additive, Gaussian noise sources with wt ∼ N (0, W) and qt ∼ N (0, Q). The model parameters (A, C, W and Q) are ﬁt with training data by correlating the observed hand kinematics with the simultaneously measured neural signals (Figure 2). For an efﬁcient decoding, we derived the steady-state update equation by replacing the adaptive Kalman gain by its steady-state formulation: K = (I + WCQ−1 C)−1 W CT Q−1 . This yields the following estimate of the system’s state: xt = (I − KC)Axt−1 + Kyt = MDT xt−1 + MDT yt , x y 4 (9) a Velocity (cm/s) Neuron 10 c 150 5 100 b 50 20 0 −20 0 0 x−velocity y−velocity 2000 4000 6000 8000 Time (ms) 10000 12000 1cm 14000 Trials: 0034-0049 Figure 2: Neural and kinematic measurements (monkey J, 2011-04-16, 16 continuous trials) used to ﬁt the standard Kalman ﬁlter model. a. The 192 cortical recordings fed as input to ﬁt the Kalman ﬁlter’s matrices (color code refers to the number of threshold crossings observed in each 50ms bin). b. Hand x- and y-velocity measurements correlated with the neural data to obtain the Kalman ﬁlter’s matrices. c. Cursor kinematics of 16 continuous trials under direct hand control. where MDT = (I − KC)A and MDT = K are the discrete time (DT) Kalman matrices. The steadyx y state formulation improves efﬁciency with little loss in accuracy because the optimal Kalman gain rapidly converges (typically less than 100 iterations). Indeed, in neural applications under both open-loop and closed-loop conditions, the difference between the full Kalman ﬁlter and its steadystate implementation falls to within 1% in a few seconds [17]. This simplifying assumption reduces the execution time for decoding a typical neuronal ﬁring rate signal approximately seven-fold [17], a critical speed-up for real-time applications. 4 Kalman ﬁlter with a spiking neural network To implement the Kalman ﬁlter with a SNN by applying the NEF, we ﬁrst convert Equation 9 from DT to continuous time (CT), and then replace the CT matrices with neurally plausible ones, which yields: x(t) = h(t) ∗ A x(t) + B y(t) , (10) where A = τMCT + I, B = τMCT , with MCT = MDT − I /∆t and MCT = MDT /∆t, the CT x y x x y y Kalman matrices, and ∆t = 50ms, the discrete time step; τ is the synaptic time-constant. The jth neuron’s input current (see Equation 3) is computed from the system’s current state, x(t), which is computed from estimates of the system’s previous state (ˆ (t) = ∑i ai (t)φix ) and current x y input (ˆ (t) = ∑k bk (t)φk ) using Equation 10. This yields: y ˜x J j (x(t)) = α j φ j · x(t) + J bias j ˜x ˆ ˆ = α j φ j · h(t) ∗ A x(t) + B y(t) ˜x = α j φ j · h(t) ∗ A + J bias j ∑ ai (t)φix + B ∑ bk (t)φky i + J bias j (11) k This last equation can be written in a neural network form: J j (x(t)) = h(t) ∗ ∑ ω ji ai (t) + ∑ ω jk bk (t) i + J bias j (12) k y ˜x ˜x where ω ji = α j φ j A φix and ω jk = α j φ j B φk are the recurrent and feedforward weights, respectively. 5 Efﬁcient implementation of the SNN In this section, we describe the two distinct steps carried out when implementing the SNN: creating and running the network. The ﬁrst step has no computational constraints whereas the second must be very efﬁcient in order to be successfully deployed in the closed-loop experimental setting. 5 x ( 1000 x ( = 1000 1000 = 1000 x 1000 b 1000 x 1000 1000 a Figure 3: Computing a 1000-neuron pool’s recurrent connections. a. Using connection weights requires multiplying a 1000×1000 matrix by a 1000 ×1 vector. b. Operating in the lower-dimensional state space requires multiplying a 1 × 1000 vector by a 1000 × 1 vector to get the decoded state, multiplying this state by a component of the A matrix to update it, and multiplying the updated state by a 1000 × 1 vector to re-encode it as ﬁring rates, which are then used to update the soma current for every neuron. Network creation: This step generates, for a speciﬁed number of neurons composing the network, x ˜x the gain α j , bias current J bias , preferred direction φ j , and decoding weight φ j for each neuron. The j ˜x preferred directions φ j are drawn randomly from a uniform distribution over the unit sphere. The maximum ﬁring rate, max G(J j (x)), and the normalized x-axis intercept, G(J j (x)) = 0, are drawn randomly from a uniform distribution on [200, 400] Hz and [-1, 1], respectively. From these two speciﬁcations, α j and J bias are computed using Equation 2 and Equation 3. The decoding weights j x φ j are computed by minimizing the mean square error (Equation 6). For efﬁcient implementation, we used two 1D integrators (i.e., two recurrent neuron pools, with each pool representing a scalar) rather than a single 3D integrator (i.e., one recurrent neuron pool, with the pool representing a 3D vector by itself) [13]. The constant 1 is fed to the 1D integrators as an input, rather than continuously integrated as part of the state vector. We also replaced the bk (t) units’ spike rates (Figure 1, middle) with the 192 neural measurements (spike counts in 50ms bins), y which is equivalent to choosing φk from a standard basis (i.e., a unit vector with 1 at the kth position and 0 everywhere else) [7]. Network simulation: This step runs the simulation to update the soma current for every neuron, based on input spikes. The soma voltage is then updated following RC circuit dynamics. Gaussian noise is normally added at this step, the rest of the simulation being noiseless. Neurons with soma voltage above threshold generate a spike and enter their refractory period. The neuron ﬁring rates are decoded using the linear decoding weights to get the updated states values, x and y-velocity. These values are smoothed with a ﬁlter identical to h(t), but with τ set to 5ms instead of 20ms to avoid introducing signiﬁcant delay. Then the simulation step starts over again. In order to ensure rapid execution of the simulation step, neuron interactions are not updated dix rectly using the connection matrix (Equation 12), but rather indirectly with the decoding matrix φ j , ˜x dynamics matrix A , and preferred direction matrix φ j (Equation 11). To see why this is more efﬁcient, suppose we have 1000 neurons in the a population for each of the state vector’s two scalars. Computing the recurrent connections using connection weights requires multiplying a 1000 × 1000 matrix by a 1000-dimensional vector (Figure 3a). This requires 106 multiplications and about 106 sums. Decoding each scalar (i.e., ∑i ai (t)φix ), however, requires only 1000 multiplications and 1000 sums. The decoded state vector is then updated by multiplying it by the (diagonal) A matrix, another 2 products and 1 sum. The updated state vector is then encoded by multiplying it with the neurons’ preferred direction vectors, another 1000 multiplications per scalar (Figure 3b). The resulting total of about 3000 operations is nearly three orders of magnitude fewer than using the connection weights to compute the identical transformation. To measure the speedup, we simulated a 2,000-neuron network on a computer running Matlab 2011a (Intel Core i7, 2.7-GHz, Mac OS X Lion). Although the exact run-times depend on the computing hardware and software, the run-time reduction factor should remain approximately constant across platforms. For each reported result, we ran the simulation 10 times to obtain a reliable estimate of the execution time. The run-time for neuron interactions using the recurrent connection weights was 9.9ms and dropped to 2.7µs in the lower-dimensional space, approximately a 3,500-fold speedup. Only the recurrent interactions beneﬁt from the speedup, the execution time for the rest of the operations remaining constant. The run-time for a 50ms network simulation using the recurrent connec6 Table 1: Model parameters Symbol max G(J j (x)) G(J j (x)) = 0 J bias j αj ˜x φj Range 200-400 Hz −1 to 1 Satisﬁes ﬁrst two Satisﬁes ﬁrst two ˜x φj = 1 Description Maximum ﬁring rate Normalized x-axis intercept Bias current Gain factor Preferred-direction vector σ2 τ RC j τ ref j τ PSC j 0.1 20 ms 1 ms 20 ms Gaussian noise variance RC time constant Refractory period PSC time constant tion weights was 0.94s and dropped to 0.0198s in the lower-dimensional space, a 47-fold speedup. These results demonstrate the efﬁciency the lower-dimensional space offers, which made the closedloop application of SNNs possible. 6 Closed-loop implementation An adult male rhesus macaque (monkey J) was trained to perform a center-out-and-back reaching task for juice rewards to one of eight targets, with a 500ms hold time (Figure 4a) [1]. All animal protocols and procedures were approved by the Stanford Institutional Animal Care and Use Committee. Hand position was measured using a Polaris optical tracking system at 60Hz (Northern Digital Inc.). Neural data were recorded from two 96-electrode silicon arrays (Blackrock Microsystems) implanted in the dorsal pre-motor and motor cortex. These recordings (-4.5 RMS threshold crossing applied to each electrode’s signal) yielded tuned activity for the direction and speed of arm movements. As detailed in [1], a standard Kalman ﬁlter model was ﬁt by correlating the observed hand kinematics with the simultaneously measured neural signals, while the monkey moved his arm to acquire virtual targets (Figure 2). The resulting model was used in a closed-loop system to control an on-screen cursor in real-time (Figure 4a, Decoder block). A steady-state version of this model serves as the standard against which the SNN implementation’s performance is compared. We built a SNN using the NEF methodology based on derived Kalman ﬁlter parameters mentioned above. This SNN was then simulated on an xPC Target (Mathworks) x86 system (Dell T3400, Intel Core 2 Duo E8600, 3.33GHz). It ran in closed-loop, replacing the standard Kalman ﬁlter as the decoder block in Figure 4a. The parameter values listed in Table 1 were used for the SNN implementation. We ensured that the time constants τiRC ,τiref , and τiPSC were smaller than the implementation’s time step (50ms). Noise was not explicitly added. It arose naturally from the ﬂuctuations produced by representing a scalar with ﬁltered spike trains, which has been shown to have effects similar to Gaussian noise [11]. For the purpose of computing the linear decoding weights (i.e., Γ), we modeled the resulting noise as Gaussian with a variance of 0.1. A 2,000-neuron version of the SNN-based decoder was tested in a closed-loop system, the largest network our embedded MatLab implementation could run in real-time. There were 1206 trials total among which 301 (center-outs only) were performed with the SNN and 302 with the standard (steady-state) Kalman ﬁlter. The block structure was randomized and interleaved, so that there is no behavioral bias present in the ﬁndings. 100 trials under hand control are used as a baseline comparison. Success corresponds to a target acquisition under 1500ms, with 500ms hold time. Success rates were higher than 99% on all blocks for the SNN implementation and 100% for the standard Kalman ﬁlter. The average time to acquire the target was slightly slower for the SNN (Figure 5b)—711ms vs. 661ms, respectively—we believe this could be improved by using more neurons in the SNN.1 The average distance to target (Figure 5a) and the average velocity of the cursor (Figure 5c) are very similar. 1 Off-line, the SNN performed better as we increased the number of neurons [7]. 7 a Neural Spikes b c BMI: Kalman decoder BMI: SNN decoder Decoder Cursor Velocity 1cm 1cm Trials: 2056-2071 Trials: 1748-1763 5 0 0 400 Time after Target Onset (ms) 800 Target acquisition time histogram 40 Mean cursor velocity 50 Standard Kalman filter 40 20 Hand 30 30 Spiking Neural Network 20 10 0 c Cursor Velocity (cm/s) b Mean distance to target 10 Percent of Trials (%) a Distance to Target (cm) Figure 4: Experimental setup and results. a. Data are recorded from two 96-channel silicon electrode arrays implanted in dorsal pre-motor and motor cortex of an adult male monkey performing a centerout-and-back reach task for juice rewards to one of eight targets with a 500ms hold time. b. BMI position kinematics of 16 continuous trials for the standard Kalman ﬁlter implementation. c. BMI position kinematics of 16 continuous trials for the SNN implementation. 10 0 500 1000 Target Acquire Time (ms) 1500 0 0 200 400 600 800 Time after Target Onset (ms) 1000 Figure 5: SNN (red) performance compared to standard Kalman ﬁlter (blue) (hand trials are shown for reference (yellow)). The SNN achieves similar results—success rates are higher than 99% on all blocks—as the standard Kalman ﬁlter implementation. a. Plot of distance to target vs. time both after target onset for different control modalities. The thicker traces represent the average time when the cursor ﬁrst enters the acceptance window until successfully entering for the 500ms hold time. b. Histogram of target acquisition time. c. Plot of mean cursor velocity vs. time. 7 Conclusions and future work The SNN’s performance was quite comparable to that produced by a standard Kalman ﬁlter implementation. The 2,000-neuron network had success rates higher than 99% on all blocks, with mean distance to target, target acquisition time, and mean cursor velocity curves very similar to the ones obtained with the standard implementation. Future work will explore whether these results extend to additional animals. As the Kalman ﬁlter and its variants are the state-of-the-art in cortically-controlled motor prostheses [1]-[5], these simulations provide conﬁdence that similar levels of performance can be attained with a neuromorphic system, which can potentially overcome the power constraints set by clinical applications. Our ultimate goal is to develop an ultra-low-power neuromorphic chip for prosthetic applications on to which control theory algorithms can be mapped using the NEF. As our next step in this direction, we will begin exploring this mapping with Neurogrid, a hardware platform with sixteen programmable neuromorphic chips that can simulate up to a million spiking neurons in real-time [9]. However, bandwidth limitations prevent Neurogrid from realizing random connectivity patterns. It can only connect each neuron to thousands of others if neighboring neurons share common inputs — just as they do in the cortex. Such columnar organization may be possible with NEF-generated networks if preferred directions vectors are assigned topographically rather than randomly. Implementing this constraint effectively is a subject of ongoing research. Acknowledgment This work was supported in part by the Belgian American Education Foundation(J. Dethier), Stanford NIH Medical Scientist Training Program (MSTP) and Soros Fellowship (P. Nuyujukian), DARPA Revolutionizing Prosthetics program (N66001-06-C-8005, K. V. Shenoy), and two NIH Director’s Pioneer Awards (DP1-OD006409, K. V. Shenoy; DPI-OD000965, K. Boahen). 8 References [1] V. Gilja, Towards clinically viable neural prosthetic systems, Ph.D. Thesis, Department of Computer Science, Stanford University, 2010, pp 19–22 and pp 57–73. [2] V. Gilja, P. Nuyujukian, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, A high-performance continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [3] P. Nuyujukian, V. Gilja, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, Generalization and robustness of a continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [4] V. Gilja, C.A. Chestek, I. Diester, J.M. Henderson, K. Deisseroth, and K.V. Shenoy, Challenges and opportunities for next-generation intra-cortically based neural prostheses, IEEE Transactions on Biomedical Engineering, 2011, in press. [5] S.P. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia, Journal of Neural Engineering, vol. 5, 2008, pp 455–476. [6] S. Kim, P. Tathireddy, R.A. Normann, and F. Solzbacher, Thermal impact of an active 3-D microelectrode array implanted in the brain, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 15, 2007, pp 493–501. [7] J. Dethier, V. Gilja, P. Nuyujukian, S.A. Elassaad, K.V. Shenoy, and K. Boahen, Spiking neural network decoder for brain-machine interfaces, IEEE Engineering in Medicine & Biology Society Conference on Neural Engineering, Cancun, Mexico, 2011, pp 396–399. [8] K. Boahen, Neuromorphic microchips, Scientiﬁc American, vol. 292(5), 2005, pp 56–63. [9] R. Silver, K. Boahen, S. Grillner, N. Kopell, and K.L. Olsen, Neurotech for neuroscience: unifying concepts, organizing principles, and emerging tools, Journal of Neuroscience, vol. 27(44), 2007, pp 11807– 11819. [10] J.V. Arthur and K. Boahen, Silicon neuron design: the dynamical systems approach, IEEE Transactions on Circuits and Systems, vol. 58(5), 2011, pp 1034-1043. [11] C. Eliasmith and C.H. Anderson, Neural engineering: computation, representation, and dynamics in neurobiological systems, MIT Press, Cambridge, MA; 2003. [12] C. Eliasmith, A uniﬁed approach to building and controlling spiking attractor networks, Neural Computation, vol. 17, 2005, pp 1276–1314. [13] R. Singh and C. Eliasmith, Higher-dimensional neurons explain the tuning and dynamics of working memory cells, The Journal of Neuroscience, vol. 26(14), 2006, pp 3667–3678. [14] C. Eliasmith, How to build a brain: from function to implementation, Synthese, vol. 159(3), 2007, pp 373–388. [15] R.E. Kalman, A new approach to linear ﬁltering and prediction problems, Transactions of the ASME– Journal of Basic Engineering, vol. 82(Series D), 1960, pp 35–45. [16] G. Welsh and G. Bishop, An introduction to the Kalman Filter, University of North Carolina at Chapel Hill Chapel Hill NC, vol. 95(TR 95-041), 1995, pp 1–16. [17] W.Q. Malik, W. Truccolo, E.N. Brown, and L.R. Hochberg, Efﬁcient decoding with steady-state Kalman ﬁlter in neural interface systems, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 19(1), 2011, pp 25–34. 9

3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netﬂix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will proﬁt an amount that scales according to how much her update improves performance on a released test set. 1

4 nips-2011-A Convergence Analysis of Log-Linear Training

Author: Simon Wiesler, Hermann Ney

Abstract: Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our ﬁndings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. 1

5 nips-2011-A Denoising View of Matrix Completion

Author: Weiran Wang, Zhengdong Lu, Miguel Á. Carreira-Perpiñán

Abstract: In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with meanshift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data. Completing a matrix from a few given entries is a fundamental problem with many applications in machine learning, computer vision, network engineering, and data mining. Much interest in matrix completion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well as by the now celebrated Netﬂix challenge on practical prediction problems [3, 4]. Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. Matrix completion models almost always assume a low-rank structure of the matrix, which is partially justiﬁed through factor models [4] and fast convex relaxation [2], and often works quite well when the observations are sparse and/or noisy. The low-rank structure of the matrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional subspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when each column of the matrix represents a snapshot of a seriously corrupted motion capture sequence (see section 3), for which a more ﬂexible model, namely a curved manifold, is more appropriate. In this paper, we present a novel view of matrix completion based on manifold denoising, which conceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold denoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back to the correct surface (deﬁned in some way). However, with a large proportion of missing entries, we may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve it iteratively. Therefore the “noise” may be due not just to intrinsic noise, but mostly to inaccurately estimated missing entries. We show that our algorithm can be motivated from an objective purely based on denoising, and prove its convergence under some conditions. We then consider a more general case with a nonlinear low-dimensional manifold and use a stopping criterion that works successfully in practice. Our model reduces to a low-rank model when we require the manifold to be ﬂat, showing a relation with a recent thread of matrix completion models based on alternating projection [7]. In our experiments, we show that our denoising-based matrix completion model can make better use of the latent manifold structure on both artiﬁcial and real-world data sets, and yields superior recovery of the missing entries. The paper is organized as follows: section 1 reviews nonparametric denoising methods based on mean-shift updates, section 2 extends this to matrix completion by using denoising with constraints, section 3 gives experimental results, and section 4 discusses related work. 1 1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS) In Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local averaging: each data point moves to the average of its neighbors (to a certain scale), and the process is repeated. We follow the derivation in [8]. Consider a dataset {xn }N ⊂ RD and deﬁne a n=1 Gaussian kernel density estimate p(x) = 1 N N Gσ (x, xn ) (1) n=1 1 with bandwidth σ > 0 and kernel Gσ (x, xn ) ∝ exp − 2 ( x − xn /σ)2 (other kernels may be used, such as the Epanechnikov kernel, which results in sparse afﬁnities). The (non-blurring) mean-shift algorithm rearranges the stationary point equation ∇p(x) = 0 into the iterative scheme x(τ +1) = f (x(τ ) ) with N x (τ +1) = f (x (τ ) p(n|x )= (τ ) )xn p(n|x (τ ) n=1 )= exp − 1 (x(τ ) − xn )/σ 2 N n′ =1 2 exp − 1 (x(τ ) − xn′ )/σ 2 2 . (2) This converges to a mode of p from almost every initial x ∈ RD , and can be seen as taking selfadapting step sizes along the gradient (since the mean shift f (x) − x is parallel to ∇p(x)). This iterative scheme was originally proposed by [9] and it or variations of it have found widespread application in clustering [8, 10–12] and denoising of 3D point sets (surface fairing; [13, 14]) and manifolds in general [5, 6]. The blurring mean-shift algorithm applies one step of the previous scheme, initialized from every point, in parallel for all points. That is, given the dataset X = {x1 , . . . , xN }, for each xn ∈ X ˜ we obtain a new point xn = f (xn ) by applying one step of the mean-shift algorithm, and then we ˜ replace X with the new dataset X, which is a blurred (shrunk) version of X. By iterating this process we obtain a sequence of datasets X(0) , X(1) , . . . (and a corresponding sequence of kernel density estimates p(0) (x), p(1) (x), . . .) where X(0) is the original dataset and X(τ ) is obtained by blurring X(τ −1) with one mean-shift step. We can see this process as maximizing the following objective function [10] by taking parallel steps of the form (2) for each point: N p(xn ) = E(X) = n=1 1 N N N 1 e− 2 Gσ (xn , xm ) ∝ xn −xm σ 2 . (3) n,m=1 n,m=1 This process eventually converges to a dataset X(∞) where all points are coincident: a completely denoised dataset where all structure has been erased. As shown by [8], this process can be stopped early to return clusters (= locally denoised subsets of points); the number of clusters obtained is controlled by the bandwidth σ. However, here we are interested in the denoising behavior of GBMS. ˜ The GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as X = X P where X = (x1 , . . . , xN ) is a D×N matrix of data points; W is the N ×N matrix of Gaussian N afﬁnities wnm = Gσ (xn , xm ); D = diag ( n=1 wnm ) is the degree matrix; and P = WD−1 is N an N × N stochastic matrix: pnm = p(n|xm ) ∈ (0, 1) and n=1 pnm = 1. P (or rather its transpose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents the posterior probabilities of each point under the kernel density estimate (1). P is similar to the 1 1 matrix N = D− 2 WD− 2 derived from the normalized graph Laplacian commonly used in spectral clustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left eigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with ˜ an eigenvector of constant entries, iterating X = X P(X) converges to the stationary distribution of each P(X), where all points coincide. ˜ From this point of view, the product X = X P(X) can be seen as ﬁltering the dataset X with a datadependent low-pass ﬁlter P(X), which makes clear the denoising behavior. This also suggests using ˜ other ﬁlters [12] X = X φ(P(X)) as long as φ(1) = 1 and |φ(r)| < 1 for r ∈ [0, 1), such as explicit schemes φ(P) = (1 − η)I + ηP for η ∈ (0, 2], power schemes φ(P) = Pn for n = 1, 2, 3 . . . or implicit schemes φ(P) = ((1 + η)I − ηP)−1 for η > 0. One important problem with GBMS is that it denoises equally in all directions. When the data lies on a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but 2 denoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of the entire manifold (most strongly near its boundary). To prevent this, the manifold blurring meanshift algorithm (MBMS) [5] ﬁrst computes a predictor averaging step with GBMS, and then for each point xn a corrector projective step removes the step direction that lies in the local tangent space of xn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS must be stopped early to prevent excessive denoising and manifold distortions. 2 Blurring mean-shift denoising algorithms for matrix completion We consider the natural extension of GBMS to the matrix completion case by adding the constraints given by the present values. We use the subindex notation XM and XP to indicate selection of the missing or present values of the matrix XD×N , where P ⊂ U , M = U \ P and U = {(d, n): d = 1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data of the problem. Then we have the following constrained optimization problem: N Gσ (xn , xm ) max E(X) = X s.t. XP = XP . (4) n,m=1 This is similar to low-rank formulations for matrix completion that have the same constraints but use as objective function the reconstruction error with a low-rank assumption, e.g. X − ABX 2 with AD×L , BL×D and L < D. We initialize XM to the output of some other method for matrix completion, such as singular value projection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attractive. The gradient of E wrt X is a matrix of D × N whose nth column is: ∇xn E(X) = 2 σ2 N 1 e− 2 xn −xm σ 2 N (xm − xn ) ∝ m=1 2 p(m|xn )xm p(xn ) −xn + σ2 m=1 (5) and its projection on the constraint space is given by zeroing its entries having indices in P; call ΠP this projection operator. Then, we have the following step of length α ≥ 0 along the projected gradient: (τ +1) X(τ +1) = X(τ ) + αΠP (∇X E(X(τ ) )) ⇐⇒ XM (τ ) = XM + α ΠP (∇X E(X(τ ) )) M (6) which updates only the missing entries XM . Since our search direction is ascent and makes an angle with the gradient that is bounded away from π/2, and E is lower bounded, continuously differentiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L , by carrying out a line search that satisﬁes the Wolfe conditions, we are guaranteed convergence to a local stationary point, typically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search at all, instead we ﬁx the step size to the GBMS self-adapting step size, which results in a simple and faster algorithm consisting of carrying out a GBMS step on X (i.e., X(τ +1) = X(τ ) P(X(τ ) )) and then reﬁlling XP to the present values. While we describe the algorithm in this way for ease of explanation, in practice we do not actually compute the GBMS step for all xdn values, but only for the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps within the missing-data subspace. We can derive this result in a different way by starting from N the unconstrained optimization problem maxXP E(X) = n,m=1 Gσ (xn , xm ) (equivalent to (4)), computing its gradient wrt XP , equating it to zero and rearranging (in the same way the mean-shift algorithm is derived) to obtain a ﬁxed-point iteration identical to our update above. Fig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three nonparametric denoising algorithms: GBMS, MBMS and LTP). Convergence and stopping criterion As noted above, we have guaranteed convergence by simply satisfying standard line search conditions, but a line search is costly. At present we do not have (τ +1) a proof that the GBMS step size satisﬁes such conditions, or indeed that the new iterate XM increases or leaves unchanged the objective, although we have never encountered a counterexample. In fact, it turns out that none of the work about GBMS that we know about proves that either: [10] proves that ∅(X(τ +1) ) ≤ ∅(X(τ ) ) for 0 < ρ < 1, where ∅(·) is the set diameter, while [8, 12] 3 notes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While this shows that all points converge to the same location, which indeed is the global maximum of (3), it does not necessarily follow that each step decreases E. GBMS (k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X However, the question of convergence as τ → ∞ has no practical interest in a denoising setting, because achieving a total denoising almost never yields a good matrix completion. What we want is to achieve just enough denoising and stop the algorithm, as was the case with GBMS clustering, and as is the case in algorithms for image denoising. We propose to determine the optimal number of iterations, as well as the bandwidth σ and any other parameters, by cross-validation. Specifically, we select a held-out set by picking a random subset of the present entries and considering them as missing; this allows us to evaluate an error between our completion for them and the ground truth. We stop iterating when this error increases. MBMS (L, k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) Xn ← k nearest neighbors of xn (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn subtract parallel motion ∂xn ← (I − Un UT )∂xn n end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X This argument justiﬁes an algorithmic, as opposed to an opLTP (L, k) with k-nn graph: given XD×N , M timization, view of denoisingrepeat based matrix completion: apfor n = 1, . . . , N ply a denoising step, reﬁll the Xn ← k nearest neighbors of xn present values, iterate until the (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn validation error increases. This project point onto tangent space allows very general deﬁnitions ∂xn ← (I − Un UT )(µn − xn ) n end of denoising, and indeed a lowXM ← XM + (∂X)M move points’ missing entries rank projection is a form of deuntil validation error increases noising where points are not alreturn X lowed outside the linear manifold. Our formulation using Figure 1: Our denoising matrix completion algorithms, based on the objective function (4) is still Manifold Blurring Mean Shift (MBMS) and its particular cases useful in that it connects our Local Tangent Projection (LTP, k-nn graph, σ = ∞) and Gauss- denoising assumption with the ian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn more usual low-rank assumption contains all N points (full graph) or only xn ’s nearest neighbors that has been used in much ma(k-nn graph). The index M selects the components of its input trix completion work, and juscorresponding to missing values. Parameters: denoising scale σ, tiﬁes the reﬁlling step as renumber of neighbors k, local dimensionality L. sulting from the present-data constraints under a gradientprojection optimization. MBMS denoising for matrix completion Following our algorithmic-based approach to denois˜ ing, we could consider generalized GBMS steps of the form X = X φ(P(X)). For clustering, Carreira-Perpi˜ an [12] found an overrelaxed explicit step φ(P) = (1 − η)I + ηP with η ≈ 1.25 to n´ achieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that allows only for orthogonal, not tangential, point motions (deﬁned wrt their local tangent space as estimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS has 3 user parameters: the bandwidth σ (for denoising), and the latent dimensionality L and the 4 number of neighbors k (for the local tangent space and the neighborhood graph). A special case of MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting σ = ∞ (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank matrix completion locally. LTP was found in [5] to have nearly as good performance as the best σ in several problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N , σ = ∞), and no denoising (σ = 0 or L = D). Note that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using L ≥ d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional manifold and are removed by the corrector step. In practice, even if the data are assumed noiseless, the reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional manifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X and lower L values in the subsequent MBMS run. In summary, this yields a matrix completion algorithm where we apply an MBMS step, reﬁll the present values, and iterate until the validation error increases. Again, in an actual implementation we compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is less pronounced in our matrix completion setting, because we constrain some values not to change. Still, in agreement with [5], we ﬁnd MBMS to be generally superior to GBMS. Computational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2 D) and O(N 2 D + N (D + k) min(D, k)2 ), respectively. In practice with high-dimensional data, best denoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3) or (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update the neighbors at each iteration (which affects the result little), the respective cost per iteration is O(N kD) and O(N kD + N (D + k) min(D, k)2 ), thus linear in N . The graph is constructed on the initial X we use, consisting of the present values and an imputation for the missing ones achieved with a standard matrix completion method, and has a one-off cost of O(N 2 D). The cost when we have a fraction µ = |M| ∈ [0, 1] of missing data is simply the above times µ. Hence the run time ND of our mean-shift-based matrix completion algorithms is faster the more present data we have, and thus faster than the usual GBMS or MBMS case, where all data are effectively missing. 3 Experimental results We compare with representative methods of several approaches: a low-rank matrix completion method, singular value projection (SVP [7], whose performance we found similar to that of alternating least squares, ALS [3, 4]); ﬁtting a D-dimensional Gaussian model with EM and imputing the missing values of each xn as the conditional mean E {xn,Mn |xn,Pn } (we use the implementation of [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or all of these algorithms. For methods with user parameters, we set them by cross-validation in the following way: we randomly select 10% of the present entries and pretend they are missing as well, we run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction at the 10% entries we kept earlier. We repeat this over different parameters’ values and pick the one with lowest reconstruction error. We then run the algorithm with these parameters values on the entire present data and report the (test) error with the ground truth for the missing values. 100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with a random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We selected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and SVP (ﬁxed rank = 3) as initialization for our algorithm. We typically ﬁnd that these initial X are very noisy (ﬁg. 3), with some reconstructed points lying between different branches of the manifold and causing a big reconstruction error. We ﬁxed L = 2 (the known dimensionality) for MBMS and cross-validated the other parameters: σ and k for MBMS and GBMS (both using k-nn graph), and the number of iterations τ to be used. Table 1 gives the performance of MBMS and GBMS for testing, along with their optimal parameters. Fig. 3 shows the results of different methods at a few iterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect. To show that there is a wide range of σ and number of iterations τ that give good performance with GBMS and MBMS, we ﬁx k = 50 and run the algorithm with varying σ values and plot the reconstruction error for missing entries over iterations in ﬁg. 2. Both GBMS can achieve good 5 Methods Gaussian + GBMS (∞, 10, 0, 1) + MBMS (1, 20, 2, 25) SVP + GBMS (3, 50, 0, 1) + MBMS (3, 50, 2, 2) RSSE 168.1 165.8 157.2 156.8 151.4 151.8 mean 2.63 2.57 2.36 1.94 1.89 1.87 stdev 1.59 1.61 1.63 2.10 2.02 2.05 Methods nlPCA SVP + GBMS (400,140,0,1) + MBMS (500,140,9,5) Table 1: Swissroll data set: reconstruction errors obtained by different algorithms along with their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries, the mean, and the standard deviation of the pointwise reconstruction error, resp. SVP + GBMS error (RSSE) 180 170 SVP + MBMS Gaussian + GBMS 180 180 170 170 170 160 160 ∞ 160 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ stdev 42.6 39.3 37.7 34.9 Gaussian + MBMS 180 8 10 15 25 mean 26.1 21.8 18.8 17.0 Table 2: MNIST-7 data set: errors of the different algorithms and their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries (×10−4 ), the mean, and the standard deviation of pixel errors, respectively. 160 0.3 0.5 1 2 3 5 RSSE 7.77 6.99 6.54 6.03 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ Figure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different σ value). denoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide range of iterations, indicating it is able to preserve the manifold structure better. Mocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148 samples (≈ 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The motion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model, and MBMS initialized from the ﬁrst three algorithms. For nlPCA, we do a grid search for the weight decay coefﬁcient while ﬁxing its structure to be 2 × 10 × 150 units, and use an early stopping criterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and GBMS (L = 0), we do grid search for σ and k. We report the reconstruction error as a function of the proportion of missing entries from 50% to 95%. For each missing-data proportion, we randomly select 5 different sets of present values and run all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform well when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than SVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian model gives the best and most stable initialization. At 95%, all methods fail to give an acceptable reconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms. Fig. 4 shows selected reconstructions from all algorithms. MNIST digit ‘7’ The MNIST digit ‘7’ data set contains 6 265 greyscale (0–255) images of size 28 × 28. We create missing entries in a way reminiscent of run-length errors in transmission. We generate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each image and use them to black out pixels. In this way, we create a high dimensional data set (784 dimensions) with about 50% entries missing on average. Because of the loss of spatial correlations within the blocks, this missing data pattern is harder than random. The Gaussian model cannot handle such a big data set because it involves inverting large covariance matrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight decay coefﬁcient, so we picked a reasonable structure (10 × 30 × 784 units), used the default weight decay parameter in the code (10−3 ), and allowed up to 500 iterations. We only use SVP as initialization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high, 6 SVP τ =0 SVP + GBMS τ =1 SVP + MBMS τ =2 Gaussian τ =0 Gaussian + GBMS τ =1 Gaussian + MBMS τ = 25 20 20 20 20 20 20 15 15 15 15 15 15 10 10 10 10 10 10 5 5 5 5 5 5 0 0 0 0 0 0 −5 −5 −5 −5 −5 −5 −10 −10 −15 −15 −10 −5 0 5 10 15 20 −10 −15 −15 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −5 0 5 10 15 20 Figure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data to 3D with the projection matrix used for creating the data. Present values are reﬁlled for all plots. 7000 6000 error 5000 4000 frame 2 (leg distance) frame 10 (foot pose) frame 147 (leg pose) nlPCA nlPCA + GBMS nlPCA + MBMS SVP SVP + GBMS SVP + MBMS Gaussian Gaussian + GBMS Gaussian + MBMS 3000 2000 1000 0 50 60 70 80 85 90 95 % of missing data Figure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percentage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample reconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row 3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP; green, Gaussian. we used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we only had to choose σ and the number of iterations via cross-validation. Table 2 shows the methods and their corresponding error. Fig. 5 shows some representative reconstructions from different algorithms, with present values reﬁlled. The mean-shift averaging among closeby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and other artifacts created by SVP, which by their nature tend to occur in different image locations over the neighborhood of images. 4 Related work Matrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical recommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and cannot fully exploit a more complex structure of the problem, such as curved manifolds. Related work is on multi-task learning in a broad sense, which extracts the common structure shared by multiple related objects and achieves simultaneous learning on them. This includes applications such as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even learning of multiple related regressors or classiﬁers [23]. Again, all these works are essentially based on a subspace assumption, and do not generalize to more complex situations. A line of work based on a nonlinear low-rank assumption (with a latent variable z of dimensionN 2 ality L < D) involves setting up a least-squares error function minf ,Z n=1 xn − f (zn ) = N,D 2 n,d=1 (xdn − fd (zn )) where one ignores the terms for which xdn is missing, and estimates the function f and the low-dimensional data projections Z by alternating optimization. Linear functions f have been used in the homogeneity analysis literature [24], where this approach is called “missing data deleted”. Nonlinear functions f have been used recently (neural nets [20]; Gaussian processes for collaborative ﬁltering [25]). Better results are obtained if adding a projection term N 2 and optimizing over the missing data as well [26]. n=1 zn − F(xn ) 7 Orig Missing nlPCA SVP GBMS MBMS Orig Missing nlPCA SVP GBMS MBMS Figure 5: Selected reconstructions of MNIST block-occluded digits ‘7’ with different methods. Prior to our denoising-based work there have been efforts to extend the low-rank models to smooth manifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain random measurements, e.g. random projection to a low-dimensional subspace, can preserve the metric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold are both small enough. However, these observations are not suitable for matrix completion and no algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined manifold, and use this to regularize the signal when recovering the missing values. They estimate the manifold given complete data, while no complete data is assumed in our matrix completion setting. Another related work is [29], where the manifold modeled with Isomap is used in estimating the positions of satellite cameras in an iterative manner. Finally, our expectation that the value of a missing entry can be predicted from the values of neighboring points is similar to one category of collaborative ﬁltering methods that essentially use similar users/items to predict missing values [3, 4]. 5 Conclusion We have proposed a new paradigm for matrix completion, denoising, which generalizes the commonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising where the data is forced to have zero variance away from a linear manifold. More general definitions of denoising can potentially handle data that lives in a low-dimensional manifold that is nonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank at all, and naturally they handle noise in the data. Denoising works because of the fundamental fact that a missing value can be predicted by averaging nearby present values. Although we motivate our framework from a constrained optimization point of view (denoise subject to respecting the present data), we argue for an algorithmic view of denoising-based matrix completion: apply a denoising step, reﬁll the present values, iterate until the validation error increases. In turn, this allows different forms of denoising, such as based on low-rank projection (earlier work) or local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift averaging further relaxes assumptions about the data and results in a simple algorithm with very few user parameters that afford user control (denoising scale, local dimensionality) but can be set automatically by cross-validation. Our algorithms are intended to be used as a postprocessing step over a user-provided initialization of the missing values, and we show they consistently improve upon existing algorithms. The MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank. Other deﬁnitions of denoising should be possible, for example using temporal as well as spatial neighborhoods, and even applicable to discrete data if we consider denoising as a majority voting among the neighbours of a vector (with suitable deﬁnitions of votes and neighborhood). Acknowledgments Work supported by NSF CAREER award IIS–0754089. 8 References [1] Emmanuel J. Cand` s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations e of Computational Mathematics, 9(6):717–772, December 2009. [2] Emmanuel J. Cand` s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. e IEEE Trans. Information Theory, 56(5):2053–2080, April 2010. [3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative ﬁltering model. SIGKDD 2008, pages 426–434, Las Vegas, NV, August 24–27 2008. [4] Robert Bell and Yehuda Koren. Scalable collaborative ﬁltering with jointly derived neighborhood interpolation weights. ICDM 2007, pages 43–52, October 28–31 2007. ´ [5] Weiran Wang and Miguel A. Carreira-Perpi˜ an. Manifold blurring mean shift algorithms for manifold n´ denoising. CVPR 2010, pages 1759–1766, San Francisco, CA, June 13–18 2010. [6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561–568. MIT Press, 2007. [7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value projection. NIPS 2010, 23:937–945. MIT Press, 2011. ´ [8] Miguel A. Carreira-Perpi˜ an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML n´ 2006, pages 153–160. Pittsburgh, PA, June 25–29 2006. [9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, January 1975. [10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995. [11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 24(5):603–619, May 2002. ´ [12] Miguel A. Carreira-Perpi˜ an. Generalised blurring mean-shift algorithms for nonparametric clustering. n´ CVPR 2008, Anchorage, AK, June 23–28 2008. [13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351–358. [14] Mathieu Desbrun, Mark Meyer, Peter Schr¨ der, and Alan H. Barr. Implicit fairing of irregular meshes o using diffusion and curvature ﬂow. SIGGRAPH 1999, pages 317–324. [15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888– 905, August 2000. [17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. [18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second edition, 2006. [19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5):853–871, March 2001. [20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887–3895, October 15 2005. [21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. CVPR 2010, pages 763–770, 2010. [22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. CVPR 2005, pages 316–322, San Diego, CA, June 20–25 2005. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS 2006, 19:41–48. MIT Press, 2007. [24] Albert Giﬁ. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990. [25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML 2009, Montreal, Canada, June 14–18 2009. ´ [26] Miguel A. Carreira-Perpi˜ an and Zhengdong Lu. Manifold learning and missing data recovery through n´ unsupervised regression. ICDM 2011, December 11–14 2011. [27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51–77, February 2009. [28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Processing, 58(12):6140–6155, December 2010. [29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging. In Proc. 27th Conference on Picture Coding Symposium (PCS’09), pages 381–384, 2009. 9

6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

Author: Ali Tofigh, Erik Sj̦lund, Mattias H̦glund, Jens Lagergren

Abstract: Cancer has complex patterns of progression that include converging as well as diverging progressional pathways. Vogelstein’s path model of colon cancer was a pioneering contribution to cancer research. Since then, several attempts have been made at obtaining mathematical models of cancer progression, devising learning algorithms, and applying these to cross-sectional data. Beerenwinkel et al. provided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and mixtures of such. Given the small size of current and future data sets, it is important to minimize the number of parameters of a model. For this reason, we too focus on tree-based models and introduce Hidden-variable Oncogenetic Trees (HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby provide more realistic modeling. We also design global structural EM algorithms for learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are global in the sense that, during the M-step, they ﬁnd a structure that yields a global maximum of the expected complete log-likelihood rather than merely one that improves it. The algorithm for single HOTs performs very well on reasonable-sized data sets, while that for HOT-mixtures requires data sets of sizes obtainable only with tomorrow’s more cost-efﬁcient technologies. 1

7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

Author: Matthew A. Kayala, Pierre F. Baldi

Abstract: Being able to predict the course of arbitrary chemical reactions is essential to the theory and applications of organic chemistry. Previous approaches are not highthroughput, are not generalizable or scalable, or lack sufﬁcient data to be effective. We describe single mechanistic reactions as concerted electron movements from an electron orbital source to an electron orbital sink. We use an existing rule-based expert system to derive a dataset consisting of 2,989 productive mechanistic steps and 6.14 million non-productive mechanistic steps. We then pose identifying productive mechanistic steps as a ranking problem: rank potential orbital interactions such that the top ranked interactions yield the major products. The machine learning implementation follows a two-stage approach, in which we ﬁrst train atom level reactivity ﬁlters to prune 94.0% of non-productive reactions with less than a 0.1% false negative rate. Then, we train an ensemble of ranking models on pairs of interacting orbitals to learn a relative productivity function over single mechanistic reactions in a given system. Without the use of explicit transformation patterns, the ensemble perfectly ranks the productive mechanisms at the top 89.1% of the time, rising to 99.9% of the time when top ranked lists with at most four nonproductive reactions are considered. The ﬁnal system allows multi-step reaction prediction. Furthermore, it is generalizable, making reasonable predictions over reactants and conditions which the rule-based expert system does not handle.

8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

Author: Miles Lopes, Laurent Jacob, Martin J. Wainwright

Abstract: We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T 2 statistic. Working within a high-dimensional framework that allows (p, n) → ∞, we ﬁrst derive an asymptotic power function for our test, and then provide sufﬁcient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer. 1

10 nips-2011-A Non-Parametric Approach to Dynamic Programming

Author: Oliver B. Kroemer, Jan R. Peters

Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a uniﬁed view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Diﬀerence learning, Kernelized Temporal Diﬀerence learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1

11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation

Author: Mehdi Keramati, Boris S. Gutkin

Abstract: Reinforcement learning models address animal’s behavioral adaptation to its changing “external” environment, and are based on the assumption that Pavlovian, habitual and goal-directed responses seek to maximize reward acquisition. Negative-feedback models of homeostatic regulation, on the other hand, are concerned with behavioral adaptation in response to the “internal” state of the animal, and assume that animals’ behavioral objective is to minimize deviations of some key physiological variables from their hypothetical setpoints. Building upon the drive-reduction theory of reward, we propose a new analytical framework that integrates learning and regulatory systems, such that the two seemingly unrelated objectives of reward maximization and physiological-stability prove to be identical. The proposed theory shows behavioral adaptation to both internal and external states in a disciplined way. We further show that the proposed framework allows for a uniﬁed explanation of some behavioral pattern like motivational sensitivity of different associative learning mechanism, anticipatory responses, interaction among competing motivational systems, and risk aversion.

12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

Author: Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, Jieping Ye

Abstract: Discriminative learning when training and test data belong to different distributions is a challenging and complex task. Often times we have very few or no labeled data from the test or target distribution but may have plenty of labeled data from multiple related sources with different distributions. The difference in distributions may be both in marginal and conditional probabilities. Most of the existing domain adaptation work focuses on the marginal probability distribution difference between the domains, assuming that the conditional probabilities are similar. However in many real world applications, conditional probability distribution differences are as commonplace as marginal probability differences. In this paper we propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (ﬁrst stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multi-source domain adaptation formulation using the weighted Rademacher complexity measure. Empirical comparisons with existing state-of-the-art domain adaptation methods using three real-world datasets demonstrate the effectiveness of the proposed approach. 1

13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefﬁcients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1

14 nips-2011-A concave regularization technique for sparse mixture models

Author: Martin O. Larsson, Johan Ugander

Abstract: Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difﬁculty, using the so-called Mountain Pass Theorem to provide easily veriﬁable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. 1

15 nips-2011-A rational model of causal inference with continuous causes

Author: Thomas L. Griffiths, Michael James

Abstract: Rational models of causal induction have been successful in accounting for people’s judgments about causal relationships. However, these models have focused on explaining inferences from discrete data of the kind that can be summarized in a 2× 2 contingency table. This severely limits the scope of these models, since the world often provides non-binary data. We develop a new rational model of causal induction using continuous dimensions, which aims to diminish the gap between empirical and theoretical approaches and real-world causal induction. This model successfully predicts human judgments from previous studies better than models of discrete causal inference, and outperforms several other plausible models of causal induction with continuous causes in accounting for people’s inferences in a new experiment. 1

16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a ﬂat histogram sampling technique by signiﬁcantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efﬁcient weight learning in Markov Logic theories. 1

18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classiﬁcation tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classiﬁers, which are then applied at test time and then pieced together in a ﬁxed procedure determined in advance or at training time. We present an active classiﬁcation process at the test time, where each classiﬁer in a large ensemble is viewed as a potential observation that might inform our classiﬁcation process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classiﬁcation gain from each observation as well as its computational cost. The expected classiﬁcation gain is computed using a probabilistic model that uses the outcome from previous observations. This active classiﬁcation process is applied at test time for each individual test instance, resulting in an efﬁcient instance-speciﬁc decision path. We demonstrate the beneﬁt of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classiﬁcation accuracy at a fraction of the computational costs of traditional methods.

20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n, ε−1 )) preference labels for a regret of ε times the optimal loss. This is strictly better, and often signiﬁcantly better than what non-adaptive sampling could achieve. Our main result helps settle an open problem posed by learning-to-rank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? 1

21 nips-2011-Active Learning with a Drifting Distribution

22 nips-2011-Active Ranking using Pairwise Comparisons

23 nips-2011-Active dendrites: adaptation to spike-based communication

24 nips-2011-Active learning of neural response functions with Gaussian processes

25 nips-2011-Adaptive Hedge

26 nips-2011-Additive Gaussian Processes

27 nips-2011-Advice Refinement in Knowledge-Based SVMs

28 nips-2011-Agnostic Selective Classification

29 nips-2011-Algorithms and hardness results for parallel large margin learning

30 nips-2011-Algorithms for Hyper-Parameter Optimization

31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding

32 nips-2011-An Empirical Evaluation of Thompson Sampling