nips nips2009 nips2009-75 knowledge-graph by maker-knowledge-mining

75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models


Source: pdf

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Training conditional maximum entropy models on massive data sets requires significant computational resources. [sent-9, score-0.214]

2 We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. [sent-10, score-1.215]

3 We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. [sent-11, score-0.758]

4 We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. [sent-12, score-0.442]

5 These models are based on the maximum entropy principle of Jaynes [11], which consists of selecting among the models approximately consistent with the constraints, the one with the greatest entropy. [sent-14, score-0.129]

6 They benefit from a theoretical foundation similar to that of standard maxent probabilistic models used for density estimation [8]. [sent-15, score-0.225]

7 In particular, a duality theorem for conditional maxent model shows that these models belong to the exponential family. [sent-16, score-0.337]

8 While the theoretical foundation of conditional maxent models makes them attractive, the computational cost of their optimization problem is often prohibitive for data sets of several million points. [sent-18, score-0.31]

9 A number of algorithms have been described for batch training of conditional maxent models using a single processor. [sent-19, score-0.416]

10 These include generalized iterative scaling [7], improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods [15, 18]. [sent-20, score-0.246]

11 This paper examines distributed methods for training conditional maxent models that can scale to very large samples of up to 1B instances. [sent-21, score-0.523]

12 1 as that of [5] or stochastic gradient descent [21] can benefit from parallelization, but we concentrate here on batch distributed methods. [sent-23, score-0.292]

13 We examine three common distributed training methods: a distributed gradient computation method [4], a majority vote method, and a mixture weight method. [sent-24, score-1.159]

14 We analyze and compare the CPU and network time complexity of each of these methods (Section 2) and present a theoretical analysis of conditional maxent models (Section 3), including a study of the convergence of the mixture weight method, the most resource-efficient technique. [sent-25, score-0.758]

15 1 2 Distributed Training of Conditional Maxent Models In this section, we first briefly describe the optimization problem for conditional maximum entropy models, then discuss three common methods for distributed training of these models and compare their CPU and network time complexity. [sent-27, score-0.54]

16 , (xm , ym )) be a training sample of m pairs in X ×Y. [sent-34, score-0.177]

17 This paper will focus on conditional maximum entropy models with L2 regularization. [sent-39, score-0.214]

18 Given the weight vector w, the output y predicted by the model for an input x is: y = argmax pw [y|x] = argmax w · Φ(x, y). [sent-41, score-0.341]

19 Standard training methods such as iterative scaling, gradient descent, conjugate gradient, and limited-memory quasi-Newton all have the general form of Figure 1, where the update function Γ : H → H for the gradient FS (w) depends on the optimization method selected. [sent-43, score-0.35]

20 , the gradient computation in step 3 of Figure 1 can be distributed across p machines. [sent-50, score-0.261]

21 , Sp ) of pm points formed by p subsamples of 1 A batch parallel estimation technique for maxent models based on their connection with AdaBoost is also described by [5]. [sent-54, score-0.414]

22 This algorithm is quite different from the distributed gradient computation method, but, as for that method, it requires a substantial amount of network resources, since updates need to be transferred to the master at every iteration. [sent-55, score-0.423]

23 These separate gradients are then summed up to compute the exact global gradient on a single machine, which also performs the optimization step and updates the weight vector received by all other machines (Figure 2). [sent-64, score-0.393]

24 3 Majority Vote Method The ensemble methods described in the next two paragraphs are based on mixture weights µ ∈ Rp . [sent-70, score-0.17]

25 In the absence of any prior knowledge, µ is chosen to be the uniform mixture µ0 = (1/p, . [sent-72, score-0.17]

26 Instead of computing the gradient of the global function in parallel, a (weighted) majority vote method can be used. [sent-76, score-0.473]

27 Each machine receives one subsample Sk , k ∈ [1, p], and computes wk = argminw∈H FSk (w) by applying the standard training of Figure 1 to Sk . [sent-77, score-0.235]

28 The output y predicted by the majority vote method for an input x is p y = argmax y∈Y k=1 µk I(argmax pwk [y |x] = y), (3) y ∈Y where I is an indicator function of the predicate it takes as argument. [sent-78, score-0.426]

29 Alternatively, the conditional class probabilities could be used to take into account the uncertainty of each classifier: p y = argmaxy k=1 µk pwk [y|x]. [sent-79, score-0.122]

30 4 Mixture Weight Method The cost of storing p weight vectors can make the majority vote method unappealing. [sent-81, score-0.515]

31 Instead, a single mixture weight wµ can be defined form the weight vectors wk , k ∈ [1, p]: p wµ = µk wk . [sent-82, score-0.714]

32 (4) k=1 The mixture weight wµ can be used directly for classification. [sent-83, score-0.335]

33 5 Comparison of CPU and Network Times This section compares the CPU and network time complexity of the three training methods just described. [sent-85, score-0.188]

34 User CPU represents the CPU time experienced by the user, cumulative CPU the total amount of CPU time for the machines participating in the computation, and latency the experienced runtime effects due to network activity. [sent-88, score-0.35]

35 The cumulative network usage is the amount of data transferred across the network during a distributed computation. [sent-89, score-0.629]

36 For a training sample of pm points, both the user and cumulative CPU times are in Ocpu (T pmN ) when training on a single machine (Figure 1) since at each of the T iterations, the gradient computation must iterate over all pm training points and update all the components of w. [sent-90, score-0.75]

37 2), the worst-case user CPU of the gradient and parameter update computations (lines 3-4 of Figure 2) is Ocpu (mN + pN + N ) since each parallel gradient calculation takes mN to compute the gradient for m instances, p gradients of size N need to be summed, and the parameters updated. [sent-95, score-0.471]

38 In terms of network usage, a distributed gradient strategy will incur a cost of Onet (pN T ) and a latency proportional to Olat (N T ), since at each iteration w must be transmitted to each of the p machines (in parallel) and each FSk (w) returned back to the master. [sent-99, score-0.61]

39 The exact runtime cost of latency is complicated as it depends on factors such as the physical distance between the master and each machine, connectivity, the switch fabric in the network, and CPU costs required to manage messages. [sent-101, score-0.145]

40 For parallelization on massively multi-core machines [4], communication latency might be negligible. [sent-102, score-0.192]

41 However, in large data centers running commodity machines, a more common case, network latency cost can be significant. [sent-103, score-0.26]

42 The training times are identical for the majority vote and mixture weight techniques. [sent-104, score-0.731]

43 Let Tk be the number of iterations for training the kth mixture component wk and let Tmax = max{T1 , . [sent-105, score-0.352]

44 Then, the user CPU usage of training is in Ocpu (mN Tmax ), similar to that of the distributed gradient method. [sent-109, score-0.579]

45 A crucial advantage of these methods over the distributed gradient method is that their network usage is significantly less than that of the distributed gradient computation. [sent-111, score-0.835]

46 While parameters and gradients are exchanged at each iteration for this method, majority vote and mixture weight techniques only require the final weight vectors to be transferred at the conclusion of training. [sent-112, score-0.9]

47 Thus, the overall network usage is Onet (pN ) with a latency in Olat (N T ). [sent-113, score-0.401]

48 The main difference between the majority vote and mixture weight methods is the user CPU (and memory usage) for prediction which is in Ocpu (pN ) versus Ocpu (N ) for the mixture weight method. [sent-114, score-1.063]

49 Prediction could be distributed over p machines for the majority vote method, but that would incur additional machine and network bandwidth costs. [sent-115, score-0.647]

50 3 Theoretical Analysis This section presents a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique, as suggested in the previous section. [sent-116, score-0.645]

51 The results we obtain are quite general and include the proof of several fundamental properties of the weight vector w obtained when training a conditional maxent model. [sent-117, score-0.55]

52 We then give a convergence bound for w as a function of the sample size in terms of the norm of the feature space and also show a similar result for the mixture weight wµ . [sent-119, score-0.394]

53 These results are used to compare the weight vector wpm obtained by training on a sample of size pm with the mixture weight vector wµ . [sent-120, score-0.807]

54 , zm−1 , zm ), with elements in X ×Y , that differ by a single training point, which we arbitrarily set as the last one of each sample: zm = (xm , ym ) and zm = (xm , ym ). [sent-127, score-0.419]

55 Let w denote the parameter vector returned by conditional maximum entropy when trained on sample S, w the vector returned when trained on S , and let ∆w denote w − w. [sent-128, score-0.331]

56 Then, the following stability bound holds for the weight vector returned by a conditional maxent model: ∆w ≤ 2R . [sent-136, score-0.609]

57 (7) m m The gradient of w → Lzm (w) = log Lzm (w) = y∈Y y∈Y ew·Φ(xm ,y) Φ(xm , y) y ∈Y ew·Φ(xm ,y ) ew·Φ(xm ,y) −w · Φ(xm , ym ) is given by − Φ(xm , ym ) = E y∼pw [·|xm ] Φ(xm , y) − Φ(xm , ym ) . [sent-147, score-0.342]

58 Let w ∈ H be the weight vector returned by conditional maximum entropy when trained on a sample S of size m. [sent-153, score-0.452]

59 , Sp ) of pm points formed by p subsamples of m points drawn i. [sent-166, score-0.158]

60 and let wµ denote the µ-mixture weight as defined in Section 2. [sent-169, score-0.165]

61 For any µ ∈ ∆p , let wµ ∈ H denote the mixture weight vector obtained from a sample of size pm by combining the p weight vectors wk , k ∈ [1, p], each returned by conditional maximum entropy when trained on the sample Sk of size m. [sent-173, score-1.051]

62 (12) + wµ − w ≤ E wµ − w λ m/2 For the uniform mixture µ0 = (1/p, . [sent-175, score-0.17]

63 , Sp ) denote a sample differing from S by one point, say in subsample Sk . [sent-184, score-0.12]

64 Let wk denote the weight vector obtained by training on subsample Sk and wµ the mixture weight vector associated to S . [sent-185, score-0.735]

65 Then, by the triangle inequality and the stability bound of Theorem 1, the following holds: 2µk R . [sent-186, score-0.144]

66 ≤ wµ − wµ = µk wk − wk ≤ |Υ(S ) − Υ(S)| = wµ − w − wµ − w λm Thus, by McDiarmid’s inequality, −2λ2 m 2 −2 2 = exp , (14) Pr[Υ(S) − E[Υ(S)] ≥ ] ≤ exp p 2µk R 2 4R2 µ 2 k=1 m λm √ which proves the first statement and the uniform mixture case since µ0 = 1/ p. [sent-187, score-0.384]

67 Theorems 2 and 3 help us compare the mixture weight wpm obtained by training on a sample of size pm versus the mixture weight vector wµ0 . [sent-188, score-0.977]

68 E[ wµ0 − w ] converges always as fast as the expectation E[ wm − w ] for a weight vector wm obtained by training on a sample of size m since, by the triangle inequality, the following holds: p p 1 1 E[ wµ − w ] = E[ (wk − w ) ] ≤ E[ wk − w ] = E[ w1 − w ]. [sent-195, score-0.481]

69 The convergence bound for wµ0 contains two terms, one somewhat more favorable, one somewhat less than its counterpart term in the bound for wpm . [sent-198, score-0.135]

70 6 English POS [16] Sentiment RCV1-v2 [14] Speech Deja News Archive Deja News Archive 250K Gigaword [10] pm 1M 9M 26 M 50 M 306 M 306 M 1,000 M |Y| 24 3 103 129 8 8 96 |X | 500 K 500 K 10 K 39 50 K 250 K 10 K sparsity 0. [sent-199, score-0.128]

71 4 Experiments We ran a number of experiments on data sets ranging in size from 1M to 1B labeled instances (see Table 2) to compare the three distributed training methods described in Section 2. [sent-208, score-0.213]

72 Our experiments were carried out using a large cluster of commodity machines with a local shared disk space and a high rate of connectivity between each machine and between machines and disk. [sent-209, score-0.213]

73 Thus, while the processes did not run on one multi-core supercomputer, the network latency between machines was minimized. [sent-210, score-0.305]

74 We report accuracy, wall clock, cumulative CPU usage, and cumulative network usage for all of our experiments. [sent-211, score-0.454]

75 Wall clock measures the combined effects of the user CPU and latency costs (column 1 of Table 1), and includes the total time for training, including all summations. [sent-212, score-0.301]

76 Network usage measures the amount of data transferred across the network. [sent-213, score-0.22]

77 For all three methods, we used the same base implementation of conditional maximum entropy, modified only in whether or not the gradient was computed in a distributed fashion. [sent-216, score-0.377]

78 The results reported in Table 3 show that the accuracy of the mixture weight method consistently matches or exceeds that of the majority vote method. [sent-220, score-0.685]

79 For some data sets, we could not report majority vote results as all models could not fit into memory on a single machine. [sent-222, score-0.321]

80 The comparison shows that in some cases the mixture weight method takes longer and achieves somewhat better performance than the distributed gradient method while for other data sets it terminates faster, at a slight loss in accuracy. [sent-223, score-0.654]

81 However, the results clearly demonstrate that the mixture weight method achieves comparable accuracies at a much decreased cost in network bandwidth – upwards of 1000x. [sent-225, score-0.533]

82 Depending on the cost model assessed for the underlying network and CPU resources, this may make mixture weight a significantly more appealing strategy. [sent-226, score-0.448]

83 In particular, if network usage leads to significant increases in latency, unlike our current experimental set-up of high rates of connectivity, then the mixture weight method could be substantially faster to train. [sent-227, score-0.648]

84 The outlier appears to be the acoustic speech data, where both mixture weight and distributed gradient have comparable network usage, 158GB and 200GB, respectively. [sent-228, score-0.809]

85 12% 215 m 17,998 h 21 GB (m=1M,p=1k) Table 3: Accuracy and resource costs for distributed training strategies. [sent-257, score-0.273]

86 usage closer to 1GB for the mixture weight and 40GB for distributed gradient method when we discard machine-to-disk traffic. [sent-258, score-0.796]

87 As a training and evaluation corpus we used the English Gigaword corpus [10] and used the full ASCII output space of that corpus of around 100 output classes (uppercase and lowercase alphabet characters variants, digits, punctuation, and whitespace). [sent-260, score-0.227]

88 We then sub-sampled 1B characters from the corpus as well as 10k testing characters and established a training set of 1000 subsets, of 1M instances each. [sent-263, score-0.204]

89 Here, we decreased the parameter λ for the distributed gradient method since less regularization was needed when more data was available, and since there were three orders of magnitude difference between the training size for each independent model and the distributed gradient. [sent-265, score-0.557]

90 We compared only the distributed gradient and mixture weight methods since the majority vote method exceeded memory capacity. [sent-266, score-0.946]

91 On this data set, the network usage is on a different scale than most of the previous experiments, though comparable to Deja 250, with the distributed gradient method transferring 13TB across the network. [sent-267, score-0.603]

92 Overall, the mixture weight method consumes less resources: less bandwidth and less time (both wall clock and CPU). [sent-268, score-0.577]

93 With respect to accuracy, the mixture weight method does only slightly worse than the distributed gradient method. [sent-269, score-0.625]

94 The individual models in the mixture weight method ranged between 49. [sent-270, score-0.364]

95 07%, so a mixture weight model improves slightly over a random subsample models and decreases the overall variance. [sent-273, score-0.388]

96 5 Conclusion Our analysis and experiments give significant support for the mixture weight method for training very large-scale conditional maximum entropy models with L2 regularization. [sent-274, score-0.653]

97 Empirical results suggest that this method achieves similar or better accuracies while reducing network usage by about three orders of magnitude and modestly reducing the wall clock time, typically by about 15% or more. [sent-275, score-0.477]

98 In distributed environments without a high rate of connectivity, the decreased network usage of the mixture weight method should lead to substantial gains in wall clock as well. [sent-276, score-0.977]

99 A comparison of algorithms for maximum entropy parameter estimation. [sent-367, score-0.129]

100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-400, score-0.123]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lzm', 0.337), ('ocpu', 0.337), ('gb', 0.256), ('fs', 0.229), ('maxent', 0.225), ('cpu', 0.204), ('vote', 0.199), ('usage', 0.171), ('mixture', 0.17), ('weight', 0.165), ('distributed', 0.138), ('xm', 0.135), ('pm', 0.128), ('gradient', 0.123), ('majority', 0.122), ('latency', 0.117), ('network', 0.113), ('deja', 0.112), ('wk', 0.107), ('bw', 0.105), ('entropy', 0.098), ('bfs', 0.098), ('pw', 0.098), ('fsk', 0.094), ('olat', 0.094), ('onet', 0.094), ('tmax', 0.094), ('mn', 0.087), ('conditional', 0.085), ('clock', 0.084), ('pn', 0.083), ('wall', 0.08), ('training', 0.075), ('della', 0.075), ('pmn', 0.075), ('wpm', 0.075), ('machines', 0.075), ('ym', 0.073), ('user', 0.072), ('zm', 0.066), ('gigaword', 0.066), ('mcdiarmid', 0.061), ('english', 0.058), ('lz', 0.056), ('inequality', 0.055), ('subsample', 0.053), ('consumes', 0.049), ('pietra', 0.049), ('transferred', 0.049), ('characters', 0.047), ('sk', 0.046), ('cumulative', 0.045), ('resources', 0.045), ('returned', 0.044), ('sp', 0.043), ('sentiment', 0.042), ('ew', 0.042), ('archive', 0.042), ('google', 0.04), ('argmax', 0.039), ('character', 0.038), ('wm', 0.038), ('differing', 0.038), ('bgs', 0.037), ('pwk', 0.037), ('radient', 0.037), ('acoustic', 0.037), ('corpus', 0.035), ('mohri', 0.035), ('traf', 0.035), ('speech', 0.034), ('connectivity', 0.033), ('adaboost', 0.033), ('linguistics', 0.033), ('penn', 0.033), ('resource', 0.032), ('bregman', 0.032), ('news', 0.031), ('batch', 0.031), ('maximum', 0.031), ('tk', 0.03), ('subsamples', 0.03), ('commodity', 0.03), ('lebanon', 0.03), ('holds', 0.03), ('bound', 0.03), ('gradients', 0.03), ('stability', 0.03), ('sample', 0.029), ('comparable', 0.029), ('triangle', 0.029), ('method', 0.029), ('costs', 0.028), ('argmin', 0.028), ('bf', 0.028), ('argminw', 0.028), ('language', 0.027), ('theorem', 0.027), ('decreased', 0.027), ('regularization', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

2 0.13434656 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

3 0.085086182 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

4 0.079684481 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

5 0.074642338 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

6 0.072778635 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

7 0.06351538 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

8 0.061231215 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

9 0.059558611 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.057975829 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

11 0.055895701 147 nips-2009-Matrix Completion from Noisy Entries

12 0.055841062 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

13 0.053260326 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

14 0.051079854 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

15 0.051022187 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

16 0.050943445 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

17 0.049074586 128 nips-2009-Learning Non-Linear Combinations of Kernels

18 0.048647515 122 nips-2009-Label Selection on Graphs

19 0.048251659 72 nips-2009-Distribution Matching for Transduction

20 0.047465298 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.164), (1, 0.052), (2, 0.002), (3, 0.014), (4, 0.009), (5, -0.035), (6, -0.032), (7, -0.015), (8, -0.045), (9, 0.065), (10, -0.01), (11, -0.04), (12, -0.001), (13, -0.011), (14, 0.044), (15, 0.016), (16, 0.059), (17, -0.081), (18, 0.008), (19, -0.043), (20, 0.007), (21, -0.143), (22, -0.012), (23, 0.024), (24, -0.068), (25, -0.03), (26, 0.012), (27, 0.097), (28, 0.076), (29, -0.018), (30, -0.03), (31, -0.114), (32, -0.011), (33, 0.007), (34, 0.028), (35, -0.051), (36, -0.146), (37, -0.029), (38, 0.024), (39, 0.074), (40, -0.082), (41, -0.104), (42, 0.043), (43, 0.099), (44, -0.124), (45, 0.146), (46, -0.061), (47, -0.134), (48, 0.101), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93843722 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

2 0.63899261 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

3 0.5429396 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori

Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1

4 0.52820647 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

5 0.50632846 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

6 0.50460321 72 nips-2009-Distribution Matching for Transduction

7 0.50027752 48 nips-2009-Bootstrapping from Game Tree Search

8 0.49763861 129 nips-2009-Learning a Small Mixture of Trees

9 0.48926812 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

10 0.48786849 76 nips-2009-Efficient Learning using Forward-Backward Splitting

11 0.4614549 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

12 0.45019513 51 nips-2009-Clustering sequence sets for motif discovery

13 0.44802585 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

14 0.44495362 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

15 0.43686408 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

16 0.4365887 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

17 0.43167716 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

18 0.42999417 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

19 0.42865595 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

20 0.41705325 33 nips-2009-Analysis of SVM with Indefinite Kernels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.065), (25, 0.075), (35, 0.027), (36, 0.113), (39, 0.026), (58, 0.052), (61, 0.025), (71, 0.038), (81, 0.411), (86, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9684822 200 nips-2009-Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)

Author: Tao Hu, Anthony Leonardo, Dmitri B. Chklovskii

Abstract: One of the central problems in neuroscience is reconstructing synaptic connectivity in neural circuits. Synapses onto a neuron can be probed by sequentially stimulating potentially pre-synaptic neurons while monitoring the membrane voltage of the post-synaptic neuron. Reconstructing a large neural circuit using such a “brute force” approach is rather time-consuming and inefficient because the connectivity in neural circuits is sparse. Instead, we propose to measure a post-synaptic neuron’s voltage while stimulating sequentially random subsets of multiple potentially pre-synaptic neurons. To reconstruct these synaptic connections from the recorded voltage we apply a decoding algorithm recently developed for compressive sensing. Compared to the brute force approach, our method promises significant time savings that grow with the size of the circuit. We use computer simulations to find optimal stimulation parameters and explore the feasibility of our reconstruction method under realistic experimental conditions including noise and non-linear synaptic integration. Multineuronal stimulation allows reconstructing synaptic connectivity just from the spiking activity of post-synaptic neurons, even when sub-threshold voltage is unavailable. By using calcium indicators, voltage-sensitive dyes, or multi-electrode arrays one could monitor activity of multiple postsynaptic neurons simultaneously, thus mapping their synaptic inputs in parallel, potentially reconstructing a complete neural circuit. 1 In tro d uc tio n Understanding information processing in neural circuits requires systematic characterization of synaptic connectivity [1, 2]. The most direct way to measure synapses between a pair of neurons is to stimulate potentially pre-synaptic neuron while recording intra-cellularly from the potentially post-synaptic neuron [3-8]. This method can be scaled to reconstruct multiple synaptic connections onto one neuron by combining intracellular recordings from the postsynaptic neuron with photo-activation of pre-synaptic neurons using glutamate uncaging [913] or channelrhodopsin [14, 15], or with multi-electrode arrays [16, 17]. Neurons are sequentially stimulated to fire action potentials by scanning a laser beam (or electrode voltage) over a brain slice, while synaptic weights are measured by recording post-synaptic voltage. Although sequential excitation of single potentially pre-synaptic neurons could reveal connectivity, such a “brute force” approach is inefficient because the connectivity among neurons is sparse. Even among nearby neurons in the cerebral cortex, the probability of connection is only about ten percent [3-8]. Connection probability decays rapidly with the 1 distance between neurons and falls below one percent on the scale of a cortical column [3, 8]. Thus, most single-neuron stimulation trials would result in zero response making the brute force approach slow, especially for larger circuits. Another drawback of the brute force approach is that single-neuron stimulation cannot be combined efficiently with methods allowing parallel recording of neural activity, such as calcium imaging [18-22], voltage-sensitive dyes [23-25] or multi-electrode arrays [17, 26]. As these techniques do not reliably measure sub-threshold potential but report only spiking activity, they would reveal only the strongest connections that can drive a neuron to fire [2730]. Therefore, such combination would reveal only a small fraction of the circuit. We propose to circumvent the above limitations of the brute force approach by stimulating multiple potentially pre-synaptic neurons simultaneously and reconstructing individual connections by using a recently developed method called compressive sensing (CS) [31-35]. In each trial, we stimulate F neurons randomly chosen out of N potentially pre-synaptic neurons and measure post-synaptic activity. Although each measurement yields only a combined response to stimulated neurons, if synaptic inputs sum linearly in a post-synaptic neuron, one can reconstruct the weights of individual connections by using an optimization algorithm. Moreover, if the synaptic connections are sparse, i.e. only K << N potentially pre-synaptic neurons make synaptic connections onto a post-synaptic neuron, the required number of trials M ~ K log(N/K), which is much less than N [31-35]. The proposed method can be used even if only spiking activity is available. Because multiple neurons are driven to fire simultaneously, if several of them synapse on the post-synaptic neuron, they can induce one or more spikes in that neuron. As quantized spike counts carry less information than analog sub-threshold voltage recordings, reconstruction requires a larger number of trials. Yet, the method can be used to reconstruct a complete feedforward circuit from spike recordings. Reconstructing neural circuit with multi-neuronal excitation may be compared with mapping retinal ganglion cell receptive fields. Typically, photoreceptors are stimulated by white-noise checkerboard stimulus and the receptive field is obtained by Reverse Correlation (RC) in case of sub-threshold measurements or Spike-Triggered Average (STA) of the stimulus [36, 37]. Although CS may use the same stimulation protocol, for a limited number of trials, the reconstruction quality is superior to RC or STA. 2 Ma pp i ng sy na pti c inp ut s o nto o n e ne uro n We start by formalizing the problem of mapping synaptic connections from a population of N potentially pre-synaptic neurons onto a single neuron, as exemplified by granule cells synapsing onto a Purkinje cell (Figure 1a). Our experimental protocol can be illustrated using linear algebra formalism, Figure 1b. We represent synaptic weights as components of a column vector x, where zeros represent non-existing connections. Each row in the stimulation matrix A represents a trial, ones indicating neurons driven to spike once and zeros indicating non-spiking neurons. The number of rows in the stimulation matrix A is equal to the number of trials M. The column vector y represents M measurements of membrane voltage obtained by an intra-cellular recording from the post-synaptic neuron: y = Ax. (1) In order to recover individual synaptic weights, Eq. (1) must be solved for x. RC (or STA) solution to this problem is x = (ATA)-1AT y, which minimizes (y-Ax)2 if M>N. In the case M << N for a sparse circuit. In this section we search computationally for the minimum number of trials required for exact reconstruction as a function of the number of non-zero synaptic weights K out of N potentially pre-synaptic neurons. First, note that the number of trials depends on the number of stimulated neurons F. If F = 1 we revert to the brute force approach and the number of measurements is N, while for F = N, the measurements are redundant and no finite number suffices. As the minimum number of measurements is expected to scale as K logN, there must be an optimal F which makes each measurement most informative about x. To determine the optimal number of stimulated neurons F for given K and N, we search for the minimum number of trials M, which allows a perfect reconstruction of the synaptic connectivity x. For each F, we generate 50 synaptic weight vectors and attempt reconstruction from sequentially increasing numbers of trials. The value of M, at which all 50 recoveries are successful (up to computer round-off error), estimates the number of trial needed for reconstruction with probability higher than 98%. By repeating this procedure 50 times for each F, we estimate the mean and standard deviation of M. We find that, for given N and K, the minimum number of trials, M, as a function of the number of stimulated neurons, F, has a shallow minimum. As K decreases, the minimum shifts towards larger F because more neurons should be stimulated simultaneously for sparser x. For the explored range of simulation parameters, the minimum is located close to 0.1N. Next, we set F = 0.1N and explore how the minimum number of measurements required for exact reconstruction depends on K and N. Results of the simulations following the recipe described above are shown in Figure 3a. As expected, when x is sparse, M grows approximately linearly with K (Figure 3b), and logarithmically with N (Figure 3c). N = 1000 180 25 160 20 140 120 15 100 10 80 5 150 250 400 650 1000 Number of potential connections (N) K = 30 220 200 (a) 200 220 Number of measurements (M) 30 Number of measurements (M) Number of actual connections (K) Number of necessary measurements (M) (b) 180 160 140 120 100 80 5 10 15 20 25 30 Number of actual connections (K) 210 (c) 200 190 180 170 160 150 140 130 120 2 10 10 3 Number of potential connections (N) Figure 3: a) Minimum number of measurements M required for reconstruction as a function of the number of actual synapses, K, and the number of potential synapses, N. b) For given N, we find M ~ K. c) For given K, we find M ~ logN (note semi-logarithmic scale in c). 4 R o b ust nes s o f re con st r uc t io n s t o noi se a n d v io la tio n o f si m pli fy in g a ss umpt io n s To make our simulation more realistic we now take into account three possible sources of noise: 1) In reality, post-synaptic voltage on a given synapse varies from trial to trial [4, 5, 46-52], an effect we call synaptic noise. Such noise detrimentally affects reconstructions because each row of A is multiplied by a different instantiation of vector x. 2) Stimulation of neurons may be imprecise exciting a slightly different subset of neurons than intended and/or firing intended neurons multiple times. We call this effect stimulation noise. Such noise detrimentally affects reconstructions because, in its presence, the actual measurement matrix A is different from the one used for recovery. 3) A synapse may fail to release neurotransmitter with some probability. Naturally, in the presence of noise, reconstructions cannot be exact. We quantify the 4 reconstruction x − xr l2 = error ∑ N i =1 by the normalized x − xr l2–error l2 / xl , where 2 ( xi − xri ) 2 . We plot normalized reconstruction error in brute force approach (M = N = 500 trials) as a function of noise, as well as CS and RC reconstruction errors (M = 200, 600 trials), Figure 4. 2 0.9 Normalized reconstruction error ||x-x|| /||x|| 1 r 2 For each noise source, the reconstruction error of the brute force approach can be achieved with 60% fewer trials by CS method for the above parameters (Figure 4). For the same number of trials, RC method performs worse. Naturally, the reconstruction error decreases with the number of trials. The reconstruction error is most sensitive to stimulation noise and least sensitive to synaptic noise. 1 (a) 1 (b) 0.9 0.8 (c) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.7 RC: M=200 RC: M=600 Brute force method: M=500 CS: M=200 CS: M=600 0.6 0.5 0 0.05 0.1 0.15 Synaptic noise level 0.2 0.25 0 0.05 0.1 0.15 Stimulation noise level 0.2 0.25 0 0 0.05 0.1 0.15 0.2 0.25 Synaptic failure probability Figure 4: Impact of noise on the reconstruction quality for N = 500, K = 30, F = 50. a) Recovery error due to trial-to-trial variation in synaptic weight. The response y is calculated using the synaptic connectivity x perturbed by an additive Gaussian noise. The noise level is given by the coefficient of variation of synaptic weight. b) Recovery error due to stimulation noise. The matrix A used for recovery is obtained from the binary matrix used to calculate the measurement vector y by shifting, in each row, a fraction of ones specified by the noise level to random positions. c) Recovery error due to synaptic failures. The detrimental effect of the stimulation noise on the reconstruction can be eliminated by monitoring spiking activity of potentially pre-synaptic neurons. By using calcium imaging [18-22], voltage-sensitive dyes [23] or multi-electrode arrays [17, 26] one could record the actual stimulation matrix. Because most random matrices satisfy the reconstruction requirements [31, 34, 35], the actual stimulation matrix can be used for a successful recovery instead of the intended one. If neuronal activity can be monitored reliably, experiments can be done in a different mode altogether. Instead of stimulating designated neurons with high fidelity by using highly localized and intense light, one could stimulate all neurons with low probability. Random firing events can be detected and used in the recovery process. The light intensity can be tuned to stimulate the optimal number of neurons per trial. Next, we explore the sensitivity of the proposed reconstruction method to the violation of simplifying assumptions. First, whereas our simulation assumes that the actual number of connections, K, is known, in reality, connectivity sparseness is known a priori only approximately. Will this affect reconstruction results? In principle, CS does not require prior knowledge of K for reconstruction [31, 34, 35]. For the CoSaMP algorithm, however, it is important to provide value K larger than the actual value (Figure 5a). Then, the algorithm will find all the actual synaptic weights plus some extra non-zero weights, negligibly small when compared to actual ones. Thus, one can provide the algorithm with the value of K safely larger than the actual one and then threshold the reconstruction result according to the synaptic noise level. Second, whereas we assumed a linear summation of inputs [53], synaptic integration may be 2 non-linear [54]. We model non-linearity by setting y = yl + α yl , where yl represents linearly summed synaptic inputs. Results of simulations (Figure 5b) show that although nonlinearity can significantly degrade CS reconstruction quality, it still performs better than RC. 5 (b) 0.45 0.4 Actual K = 30 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 20 30 40 50 60 Normalized reconstruction error ||x-x||2/||x|| r 2 Normalized reconstrcution error ||x-x||2/||x|| r 2 (a) 0.9 0.8 0.7 CS RC 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.15-0.12-0.09-0.06-0.03 0 0.03 0.06 0.09 0.12 0.15 Relative strength of the non-linear term α × mean(y ) K fed to CoSaMP l Figure 5: Sensitivity of reconstruction error to the violation of simplifying assumptions for N = 500, K = 30, M = 200, F = 50. a) The quality of the reconstruction is not affected if the CoSaMP algorithm is fed with the value of K larger than actual. b) Reconstruction error computed in 100 realizations for each value of the quadratic term relative to the linear term. 5 Ma pp i ng sy na pti c inp ut s o nto a n e uro na l po pu la tio n Until now, we considered reconstruction of synaptic inputs onto one neuron using subthreshold measurements of its membrane potential. In this section, we apply CS to reconstructing synaptic connections onto a population of potentially post-synaptic neurons. Because in CS the choice of stimulated neurons is non-adaptive, by recording from all potentially post-synaptic neurons in response to one sequence of trials one can reconstruct a complete feedforward network (Figure 6). (a) x p(y=1) A Normalized reconstruction error ||x/||x||2−xr/||xr||2||2 y (b) (c) 100 (d) 500 700 900 1100 Number of spikes 1 STA CS 0.8 0.6 0.4 0.2 0 1000 0 300 3000 5000 7000 9000 Number of trials (M) Ax Figure 6: Mapping of a complete feedforward network. a) Each post-synaptic neuron (red) receives synapses from a sparse subset of potentially pre-synaptic neurons (blue). b) Linear algebra representation of the experimental protocol. c) Probability of firing as a function of synaptic current. d) Comparison of CS and STA reconstruction error using spike trains for N = 500, K = 30 and F = 50. Although attractive, such parallelization raises several issues. First, patching a large number of neurons is unrealistic and, therefore, monitoring membrane potential requires using different methods, such as calcium imaging [18-22], voltage sensitive dyes [23-25] or multielectrode arrays [17, 26]. As these methods can report reliably only spiking activity, the measurement is not analog but discrete. Depending on the strength of summed synaptic inputs compared to the firing threshold, the postsynaptic neuron may be silent, fire once or multiple times. As a result, the measured response y is quantized by the integer number of spikes. Such quantized measurements are less informative than analog measurements of the sub-threshold membrane potential. In the extreme case of only two quantization levels, spike or no spike, each measurement contains only 1 bit of information. Therefore, to achieve reasonable reconstruction quality using quantized measurements, a larger number of trials M>>N is required. We simulate circuit reconstruction from spike recordings in silico as follows. First, we draw synaptic weights from an experimentally motivated distribution. Second, we generate a 6 random stimulation matrix and calculate the product Ax. Third, we linear half-wave rectify this product and use the result as the instantaneous firing rate for the Poisson spike generator (Figure 6c). We used a rectifying threshold that results in 10% of spiking trials as typically observed in experiments. Fourth, we reconstruct synaptic weights using STA and CS and compare the results with the generated weights. We calculated mean error over 100 realizations of the simulation protocol (Figure 6d). Due to the non-linear spike generating procedure, x can be recovered only up to a scaling factor. We propose to calibrate x with a few brute-force measurements of synaptic weights. Thus, in calculating the reconstruction error using l2 norm, we normalize both the generated and recovered synaptic weights. Such definition is equivalent to the angular error, which is often used to evaluate the performance of STA in mapping receptive field [37, 55]. Why is CS superior to STA for a given number of trials (Figure 6d)? Note that spikeless trials, which typically constitute a majority, also carry information about connectivity. While STA discards these trials, CS takes them into account. In particular, CoSaMP starts with the STA solution as zeroth iteration and improves on it by using the results of all trials and the sparseness prior. 6 D i s c uss ion We have demonstrated that sparse feedforward networks can be reconstructed by stimulating multiple potentially pre-synaptic neurons simultaneously and monitoring either subthreshold or spiking response of potentially post-synaptic neurons. When sub-threshold voltage is recorded, significantly fewer measurements are required than in the brute force approach. Although our method is sensitive to noise (with stimulation noise worse than synapse noise), it is no less robust than the brute force approach or RC. The proposed reconstruction method can also recover inputs onto a neuron from spike counts, albeit with more trials than from sub-threshold potential measurements. This is particularly useful when intra-cellular recordings are not feasible and only spiking can be detected reliably, for example, when mapping synaptic inputs onto multiple neurons in parallel. For a given number of trials, our method yields smaller error than STA. The proposed reconstruction method assumes linear summation of synaptic inputs (both excitatory and inhibitory) and is sensitive to non-linearity of synaptic integration. Therefore, it is most useful for studying connections onto neurons, in which synaptic integration is close to linear. On the other hand, multi-neuron stimulation is closer than single-neuron stimulation to the intrinsic activity in the live brain and can be used to study synaptic integration under realistic conditions. In contrast to circuit reconstruction using intrinsic neuronal activity [56, 57], our method relies on extrinsic stimulation of neurons. Can our method use intrinsic neuronal activity instead? We see two major drawbacks of such approach. First, activity of non-monitored presynaptic neurons may significantly distort reconstruction results. Thus, successful reconstruction would require monitoring all active pre-synaptic neurons, which is rather challenging. Second, reliable reconstruction is possible only when the activity of presynaptic neurons is uncorrelated. Yet, their activity may be correlated, for example, due to common input. We thank Ashok Veeraraghavan for introducing us to CS, Anthony Leonardo for making a retina dataset available for the analysis, Lou Scheffer and Hong Young Noh for commenting on the manuscript and anonymous reviewers for helpful suggestions. References [1] Luo, L., Callaway, E.M. & Svoboda, K. (2008) Genetic dissection of neural circuits. Neuron 57(5):634-660. [2] Helmstaedter, M., Briggman, K.L. & Denk, W. (2008) 3D structural imaging of the brain with photons and electrons. Current opinion in neurobiology 18(6):633-641. [3] Holmgren, C., Harkany, T., Svennenfors, B. & Zilberter, Y. (2003) Pyramidal cell communication within local networks in layer 2/3 of rat neocortex. Journal of Physiology 551:139-153. [4] Markram, H. (1997) A network of tufted layer 5 pyramidal neurons. Cerebral Cortex 7(6):523-533. 7 [5] Markram, H., Lubke, J., Frotscher, M., Roth, A. & Sakmann, B. (1997) Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. Journal of Physiology 500(2):409-440. [6] Thomson, A.M. & Bannister, A.P. (2003) Interlaminar connections in the neocortex. Cerebral Cortex 13(1):5-14. [7] Thomson, A.M., West, D.C., Wang, Y. & Bannister, A.P. (2002) Synaptic connections and small circuits involving excitatory and inhibitory neurons in layers 2-5 of adult rat and cat neocortex: triple intracellular recordings and biocytin labelling in vitro. Cerebral Cortex 12(9):936-953. [8] Song, S., Sjostrom, P.J., Reigl, M., Nelson, S. & Chklovskii, D.B. (2005) Highly nonrandom features of synaptic connectivity in local cortical circuits. Plos Biology 3(3):e68. [9] Callaway, E.M. & Katz, L.C. (1993) Photostimulation using caged glutamate reveals functional circuitry in living brain slices. Proceedings of the National Academy of Sciences of the United States of America 90(16):7661-7665. [10] Dantzker, J.L. & Callaway, E.M. (2000) Laminar sources of synaptic input to cortical inhibitory interneurons and pyramidal neurons. Nature Neuroscience 3(7):701-707. [11] Shepherd, G.M. & Svoboda, K. (2005) Laminar and columnar organization of ascending excitatory projections to layer 2/3 pyramidal neurons in rat barrel cortex. Journal of Neuroscience 25(24):5670-5679. [12] Nikolenko, V., Poskanzer, K.E. & Yuste, R. (2007) Two-photon photostimulation and imaging of neural circuits. Nature Methods 4(11):943-950. [13] Shoham, S., O'connor, D.H., Sarkisov, D.V. & Wang, S.S. (2005) Rapid neurotransmitter uncaging in spatially defined patterns. Nature Methods 2(11):837-843. [14] Gradinaru, V., Thompson, K.R., Zhang, F., Mogri, M., Kay, K., Schneider, M.B. & Deisseroth, K. (2007) Targeting and readout strategies for fast optical neural control in vitro and in vivo. Journal of Neuroscience 27(52):14231-14238. [15] Petreanu, L., Huber, D., Sobczyk, A. & Svoboda, K. (2007) Channelrhodopsin-2-assisted circuit mapping of long-range callosal projections. Nature Neuroscience 10(5):663-668. [16] Na, L., Watson, B.O., Maclean, J.N., Yuste, R. & Shepard, K.L. (2008) A 256×256 CMOS Microelectrode Array for Extracellular Neural Stimulation of Acute Brain Slices. Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. [17] Fujisawa, S., Amarasingham, A., Harrison, M.T. & Buzsaki, G. (2008) Behavior-dependent shortterm assembly dynamics in the medial prefrontal cortex. Nature Neuroscience 11(7):823-833. [18] Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D. & Yuste, R. (2004) Synfire chains and cortical songs: temporal modules of cortical activity. Science 304(5670):559-564. [19] Ohki, K., Chung, S., Ch'ng, Y.H., Kara, P. & Reid, R.C. (2005) Functional imaging with cellular resolution reveals precise micro-architecture in visual cortex. Nature 433(7026):597-603. [20] Stosiek, C., Garaschuk, O., Holthoff, K. & Konnerth, A. (2003) In vivo two-photon calcium imaging of neuronal networks. Proceedings of the National Academy of Sciences of the United States of America 100(12):7319-7324. [21] Svoboda, K., Denk, W., Kleinfeld, D. & Tank, D.W. (1997) In vivo dendritic calcium dynamics in neocortical pyramidal neurons. Nature 385(6612):161-165. [22] Sasaki, T., Minamisawa, G., Takahashi, N., Matsuki, N. & Ikegaya, Y. (2009) Reverse optical trawling for synaptic connections in situ. Journal of Neurophysiology 102(1):636-643. [23] Zecevic, D., Djurisic, M., Cohen, L.B., Antic, S., Wachowiak, M., Falk, C.X. & Zochowski, M.R. (2003) Imaging nervous system activity with voltage-sensitive dyes. Current Protocols in Neuroscience Chapter 6:Unit 6.17. [24] Cacciatore, T.W., Brodfuehrer, P.D., Gonzalez, J.E., Jiang, T., Adams, S.R., Tsien, R.Y., Kristan, W.B., Jr. & Kleinfeld, D. (1999) Identification of neural circuits by imaging coherent electrical activity with FRET-based dyes. Neuron 23(3):449-459. [25] Taylor, A.L., Cottrell, G.W., Kleinfeld, D. & Kristan, W.B., Jr. (2003) Imaging reveals synaptic targets of a swim-terminating neuron in the leech CNS. Journal of Neuroscience 23(36):11402-11410. [26] Hutzler, M., Lambacher, A., Eversmann, B., Jenkner, M., Thewes, R. & Fromherz, P. (2006) High-resolution multitransistor array recording of electrical field potentials in cultured brain slices. Journal of Neurophysiology 96(3):1638-1645. [27] Egger, V., Feldmeyer, D. & Sakmann, B. (1999) Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neuroscience 2(12):1098-1105. [28] Feldmeyer, D., Egger, V., Lubke, J. & Sakmann, B. (1999) Reliable synaptic connections between pairs of excitatory layer 4 neurones within a single 'barrel' of developing rat somatosensory cortex. Journal of Physiology 521:169-190. [29] Peterlin, Z.A., Kozloski, J., Mao, B.Q., Tsiola, A. & Yuste, R. (2000) Optical probing of neuronal circuits with calcium indicators. Proceedings of the National Academy of Sciences of the United States of America 97(7):3619-3624. 8 [30] Thomson, A.M., Deuchars, J. & West, D.C. (1993) Large, deep layer pyramid-pyramid single axon EPSPs in slices of rat motor cortex display paired pulse and frequency-dependent depression, mediated presynaptically and self-facilitation, mediated postsynaptically. Journal of Neurophysiology 70(6):2354-2369. [31] Baraniuk, R.G. (2007) Compressive sensing. Ieee Signal Processing Magazine 24(4):118-120. [32] Candes, E.J. (2008) Compressed Sensing. Twenty-Second Annual Conference on Neural Information Processing Systems, Tutorials. [33] Candes, E.J., Romberg, J.K. & Tao, T. (2006) Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 59(8):1207-1223. [34] Candes, E.J. & Tao, T. (2006) Near-optimal signal recovery from random projections: Universal encoding strategies? Ieee Transactions on Information Theory 52(12):5406-5425. [35] Donoho, D.L. (2006) Compressed sensing. Ieee Transactions on Information Theory 52(4):12891306. [36] Ringach, D. & Shapley, R. (2004) Reverse Correlation in Neurophysiology. Cognitive Science 28:147-166. [37] Schwartz, O., Pillow, J.W., Rust, N.C. & Simoncelli, E.P. (2006) Spike-triggered neural characterization. Journal of Vision 6(4):484-507. [38] Candes, E.J. & Tao, T. (2005) Decoding by linear programming. Ieee Transactions on Information Theory 51(12):4203-4215. [39] Needell, D. & Vershynin, R. (2009) Uniform Uncertainty Principle and Signal Recovery via Regularized Orthogonal Matching Pursuit. Foundations of Computational Mathematics 9(3):317-334. [40] Tropp, J.A. & Gilbert, A.C. (2007) Signal recovery from random measurements via orthogonal matching pursuit. Ieee Transactions on Information Theory 53(12):4655-4666. [41] Dai, W. & Milenkovic, O. (2009) Subspace Pursuit for Compressive Sensing Signal Reconstruction. Ieee Transactions on Information Theory 55(5):2230-2249. [42] Needell, D. & Tropp, J.A. (2009) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis 26(3):301-321. [43] Varshney, L.R., Sjostrom, P.J. & Chklovskii, D.B. (2006) Optimal information storage in noisy synapses under resource constraints. Neuron 52(3):409-423. [44] Brunel, N., Hakim, V., Isope, P., Nadal, J.P. & Barbour, B. (2004) Optimal information storage and the distribution of synaptic weights: perceptron versus Purkinje cell. Neuron 43(5):745-757. [45] Napoletani, D. & Sauer, T.D. (2008) Reconstructing the topology of sparsely connected dynamical networks. Physical Review E 77(2):026103. [46] Allen, C. & Stevens, C.F. (1994) An evaluation of causes for unreliability of synaptic transmission. Proceedings of the National Academy of Sciences of the United States of America 91(22):10380-10383. [47] Hessler, N.A., Shirke, A.M. & Malinow, R. (1993) The probability of transmitter release at a mammalian central synapse. Nature 366(6455):569-572. [48] Isope, P. & Barbour, B. (2002) Properties of unitary granule cell-->Purkinje cell synapses in adult rat cerebellar slices. Journal of Neuroscience 22(22):9668-9678. [49] Mason, A., Nicoll, A. & Stratford, K. (1991) Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. Journal of Neuroscience 11(1):72-84. [50] Raastad, M., Storm, J.F. & Andersen, P. (1992) Putative Single Quantum and Single Fibre Excitatory Postsynaptic Currents Show Similar Amplitude Range and Variability in Rat Hippocampal Slices. European Journal of Neuroscience 4(1):113-117. [51] Rosenmund, C., Clements, J.D. & Westbrook, G.L. (1993) Nonuniform probability of glutamate release at a hippocampal synapse. Science 262(5134):754-757. [52] Sayer, R.J., Friedlander, M.J. & Redman, S.J. (1990) The time course and amplitude of EPSPs evoked at synapses between pairs of CA3/CA1 neurons in the hippocampal slice. Journal of Neuroscience 10(3):826-836. [53] Cash, S. & Yuste, R. (1999) Linear summation of excitatory inputs by CA1 pyramidal neurons. Neuron 22(2):383-394. [54] Polsky, A., Mel, B.W. & Schiller, J. (2004) Computational subunits in thin dendrites of pyramidal cells. Nature Neuroscience 7(6):621-627. [55] Paninski, L. (2003) Convergence properties of three spike-triggered analysis techniques. Network: Computation in Neural Systems 14(3):437-464. [56] Okatan, M., Wilson, M.A. & Brown, E.N. (2005) Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation 17(9):1927-1961. [57] Timme, M. (2007) Revealing network connectivity from response dynamics. Physical Review Letters 98(22):224101. 9

2 0.8243618 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

Author: Mikkel Schmidt

Abstract: We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efficient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of nonnegative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.

same-paper 3 0.81313699 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

4 0.73415548 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

Abstract: We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). We present a randomized iterative algorithm based on simple local updates. The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by first, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. Somewhat surprisingly, we show that this algorithm finds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i.e. MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. Through extensive simulations, we show that our algorithm finds extremely good approximate solutions for various kinds of MRFs with geometry.

5 0.58061063 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

Author: Steven Chase, Andrew Schwartz, Wolfgang Maass, Robert A. Legenstein

Abstract: The control of neuroprosthetic devices from the activity of motor cortex neurons benefits from learning effects where the function of these neurons is adapted to the control task. It was recently shown that tuning properties of neurons in monkey motor cortex are adapted selectively in order to compensate for an erroneous interpretation of their activity. In particular, it was shown that the tuning curves of those neurons whose preferred directions had been misinterpreted changed more than those of other neurons. In this article, we show that the experimentally observed self-tuning properties of the system can be explained on the basis of a simple learning rule. This learning rule utilizes neuronal noise for exploration and performs Hebbian weight updates that are modulated by a global reward signal. In contrast to most previously proposed reward-modulated Hebbian learning rules, this rule does not require extraneous knowledge about what is noise and what is signal. The learning rule is able to optimize the performance of the model system within biologically realistic periods of time and under high noise levels. When the neuronal noise is fitted to experimental data, the model produces learning effects similar to those found in monkey experiments.

6 0.55588764 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.54996663 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

8 0.54123998 121 nips-2009-Know Thy Neighbour: A Normative Theory of Synaptic Depression

9 0.53523701 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

10 0.50409162 13 nips-2009-A Neural Implementation of the Kalman Filter

11 0.49482745 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

12 0.49259263 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

13 0.47196934 70 nips-2009-Discriminative Network Models of Schizophrenia

14 0.46198893 74 nips-2009-Efficient Bregman Range Search

15 0.46171209 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

16 0.46039921 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

17 0.45868683 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

18 0.45780879 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

19 0.45436919 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

20 0.45283461 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability