nips nips2005 nips2005-180 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. [sent-6, score-0.247]
2 Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. [sent-8, score-0.265]
3 In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. [sent-9, score-0.522]
4 Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. [sent-10, score-0.356]
5 , factor loadings are linear combinations of all the input variables). [sent-15, score-0.239]
6 Yet sparse representations are generally desirable since they aid human understanding (e. [sent-16, score-0.118]
7 The earliest attempts at ”sparsifying” PCA in the statistics literature consisted of simple axis rotations and component thresholding [1] with the underlying goal being essentially that of subset selection, often based on the identification of principal variables [8]. [sent-20, score-0.158]
8 [3] relaxed the “hard” cardinality constraint and solved for a convex approximation using semi-definite programming (SDP). [sent-25, score-0.201]
9 Their “direct” formulation for sparse PCA (called DSCPA) has yielded promising results that are comparable to (if not better than) Zou et al. [sent-26, score-0.153]
10 We pursued an alternative approach using a spectral formulation based on the variational principle of the Courant-Fischer “Min-Max” theorem for solving maximal eigenvalue problems in dimensionality-constrained subspaces. [sent-28, score-0.27]
11 More importantly, it points the way towards exact and provably optimal solutions using branch-and-bound search [9]. [sent-32, score-0.2]
12 Assuming we can solve for the optimal vector x, subsequent ˆ sparse factors can be obtained using recursive deflation of A, as in standard numerical routines. [sent-38, score-0.216]
13 1 Indeed, one of the contributions of this paper is in providing a sound theoretical basis for selecting k, thus clarifying the “art” of crafting sparse PCA factors. [sent-43, score-0.118]
14 max subject to Note that without the cardinality constraint, the quadratic form in Eq. [sent-44, score-0.264]
15 Therefore, the optimal objective value (variance) is simply the maximum eigenvalue λn (A) of the principal eigenvector x = un ˆ — Note: throughout the paper the rank of all (λi , ui ) is in increasing order of magnitude, hence λmin = λ1 and λmax = λn . [sent-46, score-0.316]
16 With the (nonlinear) cardinality constraint however, the optimal objective value is strictly less than λmax (A) for k < n and the principal eigenvectors are no longer instrumental in the solution. [sent-47, score-0.401]
17 1 Optimality Conditions First, let us consider what conditions must be true if the oracle revealed the optimal solution to us: a unit-norm vector x with cardinality k yielding the maximum objective value v ∗ . [sent-50, score-0.325]
18 Since this subproblem’s maximum variance is λmax (Ak ), then this must be the optimal objective v ∗ . [sent-53, score-0.192]
19 The optimal value v ∗ of the sparse PCA optimization problem in Eq. [sent-58, score-0.215]
20 (1) is equal to λmax (A∗ ), where A∗ is the k × k principal submatrix of A with the largest k k maximal eigenvalue. [sent-59, score-0.241]
21 In particular, the non-zero elements of the optimal sparse factor x are ˆ exactly equal to the elements of u∗ , the principal eigenvector of A∗ . [sent-60, score-0.408]
22 k k This underscores the inherent combinatorial nature of sparse PCA and the equivalent class of cardinality-constrained optimization problems. [sent-61, score-0.188]
23 Still, exhaustive search is a viable method for small n which guarantees optimality for “toy problems” and small real-world datasets, thus calibrating the quality of approximations (via the optimality gap). [sent-63, score-0.304]
24 2 Variational Renormalization Proposition 1 immediately suggests a rather simple but (as it turns out) quite effective computational “fix” for improving candidate sparse PC factors obtained by any continuous algorithm (e. [sent-65, score-0.254]
25 Let x be a unit-norm candidate factor with cardinality k as found by any ˜ (approximation) technique. [sent-69, score-0.231]
26 Let z be the non-zero subvector of x and uk be the principal ˜ ˜ (maximum) eigenvector of the submatrix Ak defined by the same non-zero indices of x. [sent-70, score-0.254]
27 ˜ This variational renormalization suggests (somewhat ironically) that given a continuous (approximate) solution, it is almost certainly better to discard the loadings and keep only the sparsity pattern with which to solve the smaller unconstrained subproblem for the indicated submatrix Ak . [sent-73, score-0.768]
28 This simple procedure (or “fix” as referred to herein) can never decrease the variance and will surely improve any continuous algorithm’s performance. [sent-74, score-0.168]
29 , setting the n − k smallest absolute value loadings of un (A) to zero and then normalizing to unit-norm — is therefore not recommended for sparse PCA. [sent-77, score-0.355]
30 Indeed, most of the sparse PCA factors published in the literature can be readily improved (almost by inspection) with the proper renormalization, and at the mere cost of a single k-by-k eigen-decomposition. [sent-82, score-0.182]
31 Furthermore, the spectrum of A’s principal submatrices was shown to play a key role in defining the optimal solution. [sent-86, score-0.202]
32 Not surprisingly, the two eigenvalue spectra are related by an inequality known as the Inclusion Principle. [sent-87, score-0.108]
33 Let A be a symmetric n × n matrix with spectrum λ i (A) and let Ak be any k × k principal submatrix of A for 1 ≤ k ≤ n with eigenvalues λi (Ak ). [sent-89, score-0.275]
34 The proof, which we omit, is a rather straightforward consequence of imposing a sparsity pattern of cardinality k as an additional orthogonality constraint in the variational inequality of the Courant-Fischer “Min-Max” theorem (see [13] for example). [sent-91, score-0.407]
35 In other words, the eigenvalues of a symmetric matrix form upper and lower bounds for the eigenvalues of all its principal submatrices. [sent-92, score-0.251]
36 (2) with k = n − 1 leads to the well-known eigenvalue interlacing property of symmetric matrices: λ1 (An ) ≤ λ1 (An−1 ) ≤ λ2 (An ) ≤ . [sent-94, score-0.141]
37 Note that for positive-definite symmetric matrices (covariances), augmenting Am to Am+1 (adding a new variable) will always expand the spectral range: reducing λmin and increasing λmax . [sent-98, score-0.098]
38 Thus for eigenvalue maximization, the inequality constraint card(x) ≤ k in Eq. [sent-99, score-0.108]
39 Therefore, the maximum variance is achieved at the preset upper limit k of cardinality. [sent-101, score-0.104]
40 Moreover, the function v ∗ (k), the optimal variance for a given cardinality, is 2 2 monotone increasing with range [σmax (A), λmax (A)], where σmax is the largest diagonal element (variance) in A. [sent-102, score-0.164]
41 Hence, a concise and informative way to quantify the performance of an algorithm is to plot its variance curve v (k) and compare it with the optimal v ∗ (k). [sent-103, score-0.164]
42 (2), which yields lower and upper bounds for λk (Ak ) = λmax (Ak ), λk (A) ≤ λmax (Ak ) ≤ λmax (A) (4) This shows that the k-th smallest eigenvalue of A is a lower bound for the maximum variance possible with cardinality k. [sent-105, score-0.468]
43 , in classical PCA), can also be consulted in sparse PCA to help pick the cardinality required to capture the desired (minimum) variance. [sent-109, score-0.319]
44 Note that if λ k (A) is close to λmax (A) then practically any principal submatrix Ak can yield a near-optimal solution. [sent-111, score-0.216]
45 But in branch-and-bound search, any intermediate subproblem Am , with k ≤ m ≤ n, yields a new and tighter bound λmax (Am ) for the objective v ∗ (k). [sent-114, score-0.103]
46 For example, among all m possible (m − 1)-by-(m − 1) principal submatrices of A m , obtained by deleting the j-th row and column, there is at least one submatrix A m−1 = A\j whose maximal eigenvalue is a major fraction of its parent (e. [sent-119, score-0.377]
47 189 in [4]) m−1 ∃ j : λm−1 (A\j ) ≥ λm (Am ) (5) m The implication of this inequality for search algorithms is that it is simply not possible for the spectral radius of every submatrix A\j to be arbitrarily small, especially for large m. [sent-122, score-0.257]
48 Hence, with large matrices (or large cardinality) nearly all the variance λn (A) is captured. [sent-123, score-0.14]
49 4 Combinatorial Optimization Given Propositions 1 and 2, the inclusion principle, the interlacing property and especially the monotonic nature of the variance curves v(k), a general class of (binary) integer programming (IP) optimization techniques [9] seem ideally suited for sparse PCA. [sent-125, score-0.378]
50 Indeed, a greedy technique like backward elimination is already suggested by the bound in Eq. [sent-126, score-0.134]
51 However, for small cardinalities k << n, the computational cost of backward search can grow to near maximum complexity ≈ O(n4 ). [sent-131, score-0.119]
52 Forward greedy search has worstcase complexity < O(n3 ). [sent-133, score-0.134]
53 The best overall strategy for this problem was empirically found to be a bi-directional greedy search: run a forward pass (from 1 to n) plus a second (independent) backward pass (from n to 1) and pick the better solution at each k. [sent-134, score-0.168]
54 We refer to this discrete algorithm as greedy sparse PCA or GSPCA. [sent-136, score-0.219]
55 Despite the expediency of near-optimal greedy search, it is nevertheless worthwhile to invest in optimal solution strategies, especially if the sparse PCA problem is in the application domain of finance or engineering, where even a small optimality gap can accrue substantial losses over time. [sent-137, score-0.417]
56 This exact algorithm (referred to as ESPCA) is guaranteed to terminate with the optimal solution. [sent-143, score-0.107]
57 The solutions found by dual-pass greedy search (GSPCA) were found to be ideal for initializing ESPCA, as their quality was typically quite high. [sent-145, score-0.167]
58 In practice, early termination with set thresholds based on eigenvalue bounds can be used. [sent-150, score-0.127]
59 In general, a cost-effective strategy that we can recommend is to first run GSPCA (or at least the forward pass) and then either settle for its (near-optimal) variance or else use it to initialize ESPCA for finding the optimal solution. [sent-151, score-0.164]
60 A full GSPCA run has the added benefit of giving near-optimal solutions for all cardinalities at once, with run-times that are typically O(102 ) faster than a single approximation with a continuous method. [sent-152, score-0.162]
61 In particular, we compared our performance against 3 continuous techniques: simple thresholding (ST) [1], SPCA using an “Elastic Net” L1 -regression [14] and DSPCA using semidefinite programming [3]. [sent-155, score-0.11]
62 The first 6 ordinary PCs capture 87% of the total variance, so following the methodology in [3], we compared the explanatory power of our exact method (ESPCA) using 6 sparse PCs. [sent-157, score-0.165]
63 8% of the variance with a cardinality pattern of 744111 (the k’s for the 6 PCs) thus totaling 18 non-zero loadings [14] whereas DSPCA captures 77. [sent-160, score-0.583]
64 3% with a sparser cardinality pattern 623111 totaling 14 non-zero loadings [3]. [sent-161, score-0.477]
65 , more than SPCA with 18 loadings and slightly less than DSPCA with 14 loadings. [sent-165, score-0.209]
66 Using the evaluation protocol in [3], we compared the cumulative variance and cumulative cardinality with the published results of SPCA and DSPCA in Figure 1. [sent-166, score-0.429]
67 Our goal was to match the explained variance but do so with a sparser representation. [sent-167, score-0.143]
68 The ESPCA loadings in Table 1 are optimal under the definition given in Section 2. [sent-168, score-0.269]
69 012 0 0 0 Table 1: Loadings for first 3 sparse PCs of the Pit Props data. [sent-207, score-0.118]
70 Original SPCA and DSPCA loadings taken from [14, 3]. [sent-209, score-0.209]
71 1 0 4 1 2 3 4 # of PCs 5 6 1 2 3 4 5 6 # of PCs (a) (b) Figure 1: Pit Props: (a) cumulative variance and (b) cumulative cardinality for first 6 sparse PCs. [sent-219, score-0.547]
72 The factor loadings for the first 3 sparse PCs are shown in Table 1. [sent-224, score-0.357]
73 To specifically demonstrate the benefits of the variational renormalization of Section 2. [sent-226, score-0.31]
74 2, consider SPCA’s first sparse factor in Table 1 (the 1st row of SPCA block) found by iterative (L1 -penalized) optimization and unit-norm scaling. [sent-227, score-0.185]
75 It captures 28% of the total data variance, but after the variational renormalization the variance increases to 29%. [sent-228, score-0.455]
76 Similarily, the first sparse factor of DSPCA in Table 1 (1st row of DSPCA block) captures 26. [sent-229, score-0.189]
77 6% of the total variance, whereas after variational renormalization it captures 29% — a gain of 2. [sent-230, score-0.351]
78 We now give a representative summary of our extensive Monte Carlo (MC) evaluation of GSPCA and the 3 continuous algorithms. [sent-233, score-0.122]
79 Every MC run consisted of 50,000 covariance matrices and the (normalized) variance curves v (k). [sent-235, score-0.168]
80 In our MC evaluations, all continuous methods (ST, SPCA and DSPCA) had variational renormalization post-processing (applied to their the “declared” solution). [sent-242, score-0.374]
81 Note that comparing GSPCA with the raw output of these algorithms would be rather pointless, since 0 11 10 ST SPCA DSPCA GSPCA 10 9 −1 10 log − frequency variance v(k) 8 7 6 5 4 3 −2 10 −3 10 2 DSPCA (original) DSPCA + Fix Optimal 1 −4 0 1 2 3 4 5 6 7 8 9 10 10 0. [sent-243, score-0.104]
82 95 1 optimality ratio (a) (b) Figure 2: (a) Typical variance curve v(k) for a continuous algorithm without post-processing (original: dash green) and with variational renormalization (+ Fix: solid green). [sent-246, score-0.622]
83 (b) Monte Carlo study: log-likelihood of optimality ratio at max-complexity (k = 8, n = 16) for ST (blue []), DSPCA (green ), SPCA (magenta ) and GSPCA (red ◦). [sent-251, score-0.144]
84 92 2 4 6 8 10 cardinality (k) 12 14 16 0. [sent-268, score-0.201]
85 1 0 2 4 6 8 10 12 14 16 cardinality (k) (a) (b) Figure 3: Monte Carlo summary statistics: (a) means of the distributions of optimality ratio (in Figure 2(b)) for all k and (b) estimated probability of finding the optimal solution for each cardinality. [sent-270, score-0.43]
86 without the “fix” their variance curves are markedly diminished, as in Figure 2(a). [sent-271, score-0.104]
87 Figure 2(b) shows the histogram of the optimality ratio — i. [sent-272, score-0.144]
88 , ratio of the captured to optimal variance — shown here at “half-sparsity” (k = 8, n = 16) from a typical MC run of 50,000 different covariances matrices. [sent-274, score-0.224]
89 Figure 3(a) shows the corresponding mean values of the optimality ratio for all k. [sent-276, score-0.144]
90 Among continuous algorithms, the SDP-based DSPCA was generally more effective (almost comparable to GSPCA). [sent-277, score-0.098]
91 Finally, we note that even simple thresholding (ST), once enhanced with the variational renormalization, performs quite adequately despite its simplicity, as it captures at least 92% of the optimal variance, as seen in Figure 3(a). [sent-281, score-0.245]
92 Naturally, without the variational “fix” (not shown) continuous algorithms rarely ever found the optimal solution. [sent-286, score-0.222]
93 Surprisingly, simple thresholding of the principal eigenvector (ST) was shown to be rather effective, especially given the perceived “straw-man” it was considered to be. [sent-288, score-0.225]
94 However, beyond such special cases, continuous methods can not ultimately be competitive with discrete algorithms without the variational renormalization “fix” in Section 2. [sent-291, score-0.401]
95 We should note that the somewhat remarkable effectiveness of GSPCA is not entirely unexpected and is supported by empirical observations in the combinatorial optimization literature: that greedy search with (sub)modular cost functions having the monotonicity property (e. [sent-293, score-0.204]
96 , the variance curves v (k)) is known to produce good results [9]. [sent-295, score-0.104]
97 In terms of ˜ quality of solutions, GSPCA consistently out-performed continuous algorithms, with runtimes that were typically O(102 ) faster than LARS-based SPCA and roughly O(103 ) faster than SDP-based DSPCA (Matlab CPU times averaged over all k). [sent-296, score-0.124]
98 Nevertheless, we view discrete algorithms as complementary tools, especially since the leading continuous algorithms have distinct advantages. [sent-297, score-0.12]
99 Loadings and correlations in the interpretation of principal components. [sent-311, score-0.112]
100 Two cases studies in the application of principal components. [sent-340, score-0.112]
wordName wordTfidf (topN-words)
[('dspca', 0.453), ('spca', 0.418), ('gspca', 0.296), ('espca', 0.227), ('renormalization', 0.212), ('loadings', 0.209), ('cardinality', 0.201), ('ak', 0.144), ('pit', 0.122), ('props', 0.122), ('pca', 0.122), ('pcs', 0.121), ('sparse', 0.118), ('principal', 0.112), ('optimality', 0.108), ('submatrix', 0.104), ('variance', 0.104), ('variational', 0.098), ('eigenvalue', 0.078), ('zou', 0.076), ('greedy', 0.074), ('continuous', 0.064), ('st', 0.063), ('max', 0.063), ('cumulative', 0.062), ('mc', 0.061), ('elastic', 0.061), ('optimal', 0.06), ('search', 0.06), ('inclusion', 0.055), ('toolbox', 0.052), ('bounds', 0.049), ('exact', 0.047), ('thresholding', 0.046), ('card', 0.045), ('aspremont', 0.045), ('matlab', 0.044), ('sparsity', 0.042), ('ko', 0.042), ('sdp', 0.042), ('carlo', 0.041), ('monte', 0.041), ('sparseness', 0.041), ('captures', 0.041), ('subproblem', 0.039), ('sparser', 0.039), ('eigenvector', 0.038), ('factors', 0.038), ('optimization', 0.037), ('oracle', 0.036), ('orthogonality', 0.036), ('matrices', 0.036), ('ratio', 0.036), ('bound', 0.036), ('formulation', 0.035), ('pass', 0.035), ('net', 0.035), ('avidan', 0.035), ('baback', 0.035), ('cardinalities', 0.035), ('dtu', 0.035), ('interlacing', 0.035), ('jolliffe', 0.035), ('sjostrand', 0.035), ('spectral', 0.034), ('effective', 0.034), ('benchmark', 0.033), ('solutions', 0.033), ('extensive', 0.033), ('combinatorial', 0.033), ('eigenvalues', 0.031), ('factor', 0.03), ('inequality', 0.03), ('submatrices', 0.03), ('merl', 0.03), ('queue', 0.03), ('faster', 0.03), ('especially', 0.029), ('proposition', 0.029), ('covariance', 0.028), ('nevertheless', 0.028), ('symmetric', 0.028), ('objective', 0.028), ('recommended', 0.028), ('totaling', 0.028), ('viable', 0.028), ('deleting', 0.028), ('lars', 0.028), ('discrete', 0.027), ('green', 0.026), ('mere', 0.026), ('england', 0.026), ('elements', 0.025), ('summary', 0.025), ('maximal', 0.025), ('backward', 0.024), ('magenta', 0.024), ('covariances', 0.024), ('revealing', 0.024), ('sub', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
2 0.096959502 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
3 0.083395131 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
4 0.078471154 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
Author: John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1
5 0.068994261 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
Author: Jorge Silva, Jorge Marques, João Lemos
Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1
6 0.058213174 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.058096316 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
8 0.057295181 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
9 0.055858575 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
10 0.052711435 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
11 0.051575374 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
12 0.051546253 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
13 0.050339613 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
14 0.049341366 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
15 0.047099967 50 nips-2005-Convex Neural Networks
16 0.047032841 126 nips-2005-Metric Learning by Collapsing Classes
17 0.04530368 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
18 0.045145813 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
19 0.044250216 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
20 0.044024177 75 nips-2005-Fixing two weaknesses of the Spectral Method
topicId topicWeight
[(0, 0.16), (1, 0.057), (2, -0.003), (3, -0.015), (4, 0.021), (5, -0.065), (6, -0.095), (7, -0.08), (8, -0.029), (9, -0.037), (10, 0.011), (11, -0.007), (12, -0.002), (13, -0.002), (14, -0.028), (15, 0.001), (16, -0.095), (17, -0.029), (18, -0.123), (19, -0.07), (20, 0.106), (21, -0.053), (22, -0.015), (23, 0.019), (24, 0.042), (25, -0.054), (26, -0.039), (27, 0.034), (28, 0.086), (29, 0.005), (30, 0.076), (31, 0.02), (32, -0.028), (33, -0.006), (34, 0.012), (35, 0.038), (36, -0.032), (37, 0.043), (38, -0.078), (39, -0.091), (40, -0.095), (41, 0.057), (42, 0.008), (43, 0.145), (44, 0.008), (45, 0.116), (46, 0.029), (47, -0.101), (48, 0.101), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.93202001 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
2 0.57554853 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
3 0.54377854 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
4 0.53742129 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
5 0.47314653 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
Author: Manfred Opper
Abstract: The problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efficiently using only two variational parameters. A perturbative correction to the result is computed and an alternative simplified derivation is also presented. 1
6 0.47064877 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
7 0.46830484 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
8 0.41622245 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
9 0.40003318 126 nips-2005-Metric Learning by Collapsing Classes
10 0.38338408 75 nips-2005-Fixing two weaknesses of the Spectral Method
11 0.38321954 62 nips-2005-Efficient Estimation of OOMs
12 0.3790569 146 nips-2005-On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?
13 0.37515625 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
14 0.36608672 52 nips-2005-Correlated Topic Models
15 0.35801697 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
16 0.35379902 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.34195617 71 nips-2005-Fast Krylov Methods for N-Body Learning
18 0.33887661 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
19 0.33303288 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
20 0.32264858 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
topicId topicWeight
[(3, 0.046), (10, 0.03), (27, 0.022), (31, 0.024), (34, 0.086), (39, 0.01), (41, 0.01), (55, 0.031), (69, 0.488), (73, 0.035), (77, 0.014), (88, 0.061), (91, 0.035)]
simIndex simValue paperId paperTitle
1 0.98834318 40 nips-2005-CMOL CrossNets: Possible Neuromorphic Nanoelectronic Circuits
Author: Jung Hoon Lee, Xiaolong Ma, Konstantin K. Likharev
Abstract: Hybrid “CMOL” integrated circuits, combining CMOS subsystem with nanowire crossbars and simple two-terminal nanodevices, promise to extend the exponential Moore-Law development of microelectronics into the sub-10-nm range. We are developing neuromorphic network (“CrossNet”) architectures for this future technology, in which neural cell bodies are implemented in CMOS, nanowires are used as axons and dendrites, while nanodevices (bistable latching switches) are used as elementary synapses. We have shown how CrossNets may be trained to perform pattern recovery and classification despite the limitations imposed by the CMOL hardware. Preliminary estimates have shown that CMOL CrossNets may be extremely dense (~10 7 cells per cm2) and operate approximately a million times faster than biological neural networks, at manageable power consumption. In Conclusion, we discuss in brief possible short-term and long-term applications of the emerging technology. 1 Introduction: CMOL Circuits Recent results [1, 2] indicate that the current VLSI paradigm based on CMOS technology can be hardly extended beyond the 10-nm frontier: in this range the sensitivity of parameters (most importantly, the gate voltage threshold) of silicon field-effect transistors to inevitable fabrication spreads grows exponentially. This sensitivity will probably send the fabrication facilities costs skyrocketing, and may lead to the end of Moore’s Law some time during the next decade. There is a growing consensus that the impending Moore’s Law crisis may be preempted by a radical paradigm shift from the purely CMOS technology to hybrid CMOS/nanodevice circuits, e.g., those of “CMOL” variety (Fig. 1). Such circuits (see, e.g., Ref. 3 for their recent review) would combine a level of advanced CMOS devices fabricated by the lithographic patterning, and two-layer nanowire crossbar formed, e.g., by nanoimprint, with nanowires connected by simple, similar, two-terminal nanodevices at each crosspoint. For such devices, molecular single-electron latching switches [4] are presently the leading candidates, in particular because they may be fabricated using the self-assembled monolayer (SAM) technique which already gave reproducible results for simpler molecular devices [5]. (a) nanodevices nanowiring and nanodevices interface pins upper wiring level of CMOS stack (b) βFCMOS Fnano α Fig. 1. CMOL circuit: (a) schematic side view, and (b) top-view zoom-in on several adjacent interface pins. (For clarity, only two adjacent nanodevices are shown.) In order to overcome the CMOS/nanodevice interface problems pertinent to earlier proposals of hybrid circuits [6], in CMOL the interface is provided by pins that are distributed all over the circuit area, on the top of the CMOS stack. This allows to use advanced techniques of nanowire patterning (like nanoimprint) which do not have nanoscale accuracy of layer alignment [3]. The vital feature of this interface is the tilt, by angle α = arcsin(Fnano/βFCMOS), of the nanowire crossbar relative to the square arrays of interface pins (Fig. 1b). Here Fnano is the nanowiring half-pitch, FCMOS is the half-pitch of the CMOS subsystem, and β is a dimensionless factor larger than 1 that depends on the CMOS cell complexity. Figure 1b shows that this tilt allows the CMOS subsystem to address each nanodevice even if Fnano << βFCMOS. By now, it has been shown that CMOL circuits can combine high performance with high defect tolerance (which is necessary for any circuit using nanodevices) for several digital applications. In particular, CMOL circuits with defect rates below a few percent would enable terabit-scale memories [7], while the performance of FPGA-like CMOL circuits may be several hundred times above that of overcome purely CMOL FPGA (implemented with the same FCMOS), at acceptable power dissipation and defect tolerance above 20% [8]. In addition, the very structure of CMOL circuits makes them uniquely suitable for the implementation of more complex, mixed-signal information processing systems, including ultradense and ultrafast neuromorphic networks. The objective of this paper is to describe in brief the current status of our work on the development of so-called Distributed Crossbar Networks (“CrossNets”) that could provide high performance despite the limitations imposed by CMOL hardware. A more detailed description of our earlier results may be found in Ref. 9. 2 Synapses The central device of CrossNet is a two-terminal latching switch [3, 4] (Fig. 2a) which is a combination of two single-electron devices, a transistor and a trap [3]. The device may be naturally implemented as a single organic molecule (Fig. 2b). Qualitatively, the device operates as follows: if voltage V = Vj – Vk applied between the external electrodes (in CMOL, nanowires) is low, the trap island has no net electric charge, and the single-electron transistor is closed. If voltage V approaches certain threshold value V+ > 0, an additional electron is inserted into the trap island, and its field lifts the Coulomb blockade of the single-electron transistor, thus connecting the nanowires. The switch state may be reset (e.g., wires disconnected) by applying a lower voltage V < V- < V+. Due to the random character of single-electron tunneling [2], the quantitative description of the switch is by necessity probabilistic: actually, V determines only the rates Γ↑↓ of device switching between its ON and OFF states. The rates, in turn, determine the dynamics of probability p to have the transistor opened (i.e. wires connected): dp/dt = Γ↑(1 - p) - Γ↓p. (1) The theory of single-electron tunneling [2] shows that, in a good approximation, the rates may be presented as Γ↑↓ = Γ0 exp{±e(V - S)/kBT} , (2) (a) single-electron trap tunnel junction Vj Vk single-electron transistor (b) O clipping group O N C R diimide acceptor groups O O C N R R O OPE wires O N R R N O O R O N R R = hexyl N O O R R O N C R R R Fig. 2. (a) Schematics and (b) possible molecular implementation of the two-terminal single-electron latching switch where Γ0 and S are constants depending on physical parameters of the latching switches. Note that despite the random character of switching, the strong nonlinearity of Eq. (2) allows to limit the degree of the device “fuzziness”. 3 CrossNets Figure 3a shows the generic structure of a CrossNet. CMOS-implemented somatic cells (within the Fire Rate model, just nonlinear differential amplifiers, see Fig. 3b,c) apply their output voltages to “axonic” nanowires. If the latching switch, working as an elementary synapse, on the crosspoint of an axonic wire with the perpendicular “dendritic” wire is open, some current flows into the latter wire, charging it. Since such currents are injected into each dendritic wire through several (many) open synapses, their addition provides a natural passive analog summation of signals from the corresponding somas, typical for all neural networks. Examining Fig. 3a, please note the open-circuit terminations of axonic and dendritic lines at the borders of the somatic cells; due to these terminations the somas do not communicate directly (but only via synapses). The network shown on Fig. 3 is evidently feedforward; recurrent networks are achieved in the evident way by doubling the number of synapses and nanowires per somatic cell (Fig. 3c). Moreover, using dual-rail (bipolar) representation of the signal, and hence doubling the number of nanowires and elementary synapses once again, one gets a CrossNet with somas coupled by compact 4-switch groups [9]. Using Eqs. (1) and (2), it is straightforward to show that that the average synaptic weight wjk of the group obeys the “quasi-Hebbian” rule: d w jk = −4Γ0 sinh (γ S ) sinh (γ V j ) sinh (γ Vk ) . dt (3) (a) - +soma j (b) RL + -- jk+ RL (c) jk- RL + -- -+soma k RL Fig. 3. (a) Generic structure of the simplest, (feedforward, non-Hebbian) CrossNet. Red lines show “axonic”, and blue lines “dendritic” nanowires. Gray squares are interfaces between nanowires and CMOS-based somas (b, c). Signs show the dendrite input polarities. Green circles denote molecular latching switches forming elementary synapses. Bold red and blue points are open-circuit terminations of the nanowires, that do not allow somas to interact in bypass of synapses In the simplest cases (e.g., quasi-Hopfield networks with finite connectivity), the tri-level synaptic weights of the generic CrossNets are quite satisfactory, leading to just a very modest (~30%) network capacity loss. However, some applications (in particular, pattern classification) may require a larger number of weight quantization levels L (e.g., L ≈ 30 for a 1% fidelity [9]). This may be achieved by using compact square arrays (e.g., 4×4) of latching switches (Fig. 4). Various species of CrossNets [9] differ also by the way the somatic cells are distributed around the synaptic field. Figure 5 shows feedforward versions of two CrossNet types most explored so far: the so-called FlossBar and InBar. The former network is more natural for the implementation of multilayered perceptrons (MLP), while the latter system is preferable for recurrent network implementations and also allows a simpler CMOS design of somatic cells. The most important advantage of CrossNets over the hardware neural networks suggested earlier is that these networks allow to achieve enormous density combined with large cell connectivity M >> 1 in quasi-2D electronic circuits. 4 CrossNet training CrossNet training faces several hardware-imposed challenges: (i) The synaptic weight contribution provided by the elementary latching switch is binary, so that for most applications the multi-switch synapses (Fig. 4) are necessary. (ii) The only way to adjust any particular synaptic weight is to turn ON or OFF the corresponding latching switch(es). This is only possible to do by applying certain voltage V = Vj – Vk between the two corresponding nanowires. At this procedure, other nanodevices attached to the same wires should not be disturbed. (iii) As stated above, synapse state switching is a statistical progress, so that the degree of its “fuzziness” should be carefully controlled. (a) Vj (b) V w – A/2 i=1 i=1 2 2 … … n n Vj V w+ A/2 i' = 1 RL 2 … i' = 1 n RS ±(V t –A/2) 2 … RS n ±(V t +A/2) Fig. 4. Composite synapse for providing L = 2n2+1 discrete levels of the weight in (a) operation and (b) weight adjustment modes. The dark-gray rectangles are resistive metallic strips at soma/nanowire interfaces (a) (b) Fig. 5. Two main CrossNet species: (a) FlossBar and (b) InBar, in the generic (feedforward, non-Hebbian, ternary-weight) case for the connectivity parameter M = 9. Only the nanowires and nanodevices coupling one cell (indicated with red dashed lines) to M post-synaptic cells (blue dashed lines) are shown; actually all the cells are similarly coupled We have shown that these challenges may be met using (at least) the following training methods [9]: (i) Synaptic weight import. This procedure is started with training of a homomorphic “precursor” artificial neural network with continuous synaptic weighs wjk, implemented in software, using one of established methods (e.g., error backpropagation). Then the synaptic weights wjk are transferred to the CrossNet, with some “clipping” (rounding) due to the binary nature of elementary synaptic weights. To accomplish the transfer, pairs of somatic cells are sequentially selected via CMOS-level wiring. Using the flexibility of CMOS circuitry, these cells are reconfigured to apply external voltages ±VW to the axonic and dendritic nanowires leading to a particular synapse, while all other nanowires are grounded. The voltage level V W is selected so that it does not switch the synapses attached to only one of the selected nanowires, while voltage 2VW applied to the synapse at the crosspoint of the selected wires is sufficient for its reliable switching. (In the composite synapses with quasi-continuous weights (Fig. 4), only a part of the corresponding switches is turned ON or OFF.) (ii) Error backpropagation. The synaptic weight import procedure is straightforward when wjk may be simply calculated, e.g., for the Hopfield-type networks. However, for very large CrossNets used, e.g., as pattern classifiers the precursor network training may take an impracticably long time. In this case the direct training of a CrossNet may become necessary. We have developed two methods of such training, both based on “Hebbian” synapses consisting of 4 elementary synapses (latching switches) whose average weight dynamics obeys Eq. (3). This quasi-Hebbian rule may be used to implement the backpropagation algorithm either using a periodic time-multiplexing [9] or in a continuous fashion, using the simultaneous propagation of signals and errors along the same dual-rail channels. As a result, presently we may state that CrossNets may be taught to perform virtually all major functions demonstrated earlier with the usual neural networks, including the corrupted pattern restoration in the recurrent quasi-Hopfield mode and pattern classification in the feedforward MLP mode [11]. 5 C r o s s N e t p e r f o r m an c e e s t i m a t e s The significance of this result may be only appreciated in the context of unparalleled physical parameters of CMOL CrossNets. The only fundamental limitation on the half-pitch Fnano (Fig. 1) comes from quantum-mechanical tunneling between nanowires. If the wires are separated by vacuum, the corresponding specific leakage conductance becomes uncomfortably large (~10-12 Ω-1m-1) only at Fnano = 1.5 nm; however, since realistic insulation materials (SiO2, etc.) provide somewhat lower tunnel barriers, let us use a more conservative value Fnano= 3 nm. Note that this value corresponds to 1012 elementary synapses per cm2, so that for 4M = 104 and n = 4 the areal density of neural cells is close to 2×107 cm-2. Both numbers are higher than those for the human cerebral cortex, despite the fact that the quasi-2D CMOL circuits have to compete with quasi-3D cerebral cortex. With the typical specific capacitance of 3×10-10 F/m = 0.3 aF/nm, this gives nanowire capacitance C0 ≈ 1 aF per working elementary synapse, because the corresponding segment has length 4Fnano. The CrossNet operation speed is determined mostly by the time constant τ0 of dendrite nanowire capacitance recharging through resistances of open nanodevices. Since both the relevant conductance and capacitance increase similarly with M and n, τ0 ≈ R0C0. The possibilities of reduction of R0, and hence τ0, are limited mostly by acceptable power dissipation per unit area, that is close to Vs2/(2Fnano)2R0. For room-temperature operation, the voltage scale V0 ≈ Vt should be of the order of at least 30 kBT/e ≈ 1 V to avoid thermally-induced errors [9]. With our number for Fnano, and a relatively high but acceptable power consumption of 100 W/cm2, we get R0 ≈ 1010Ω (which is a very realistic value for single-molecule single-electron devices like one shown in Fig. 3). With this number, τ0 is as small as ~10 ns. This means that the CrossNet speed may be approximately six orders of magnitude (!) higher than that of the biological neural networks. Even scaling R0 up by a factor of 100 to bring power consumption to a more comfortable level of 1 W/cm2, would still leave us at least a four-orders-of-magnitude speed advantage. 6 D i s c u s s i on: P o s s i bl e a p p l i c at i o n s These estimates make us believe that that CMOL CrossNet chips may revolutionize the neuromorphic network applications. Let us start with the example of relatively small (1-cm2-scale) chips used for recognition of a face in a crowd [11]. The most difficult feature of such recognition is the search for face location, i.e. optimal placement of a face on the image relative to the panel providing input for the processing network. The enormous density and speed of CMOL hardware gives a possibility to time-and-space multiplex this task (Fig. 6). In this approach, the full image (say, formed by CMOS photodetectors on the same chip) is divided into P rectangular panels of h×w pixels, corresponding to the expected size and approximate shape of a single face. A CMOS-implemented communication channel passes input data from each panel to the corresponding CMOL neural network, providing its shift in time, say using the TV scanning pattern (red line in Fig. 6). The standard methods of image classification require the network to have just a few hidden layers, so that the time interval Δt necessary for each mapping position may be so short that the total pattern recognition time T = hwΔt may be acceptable even for online face recognition. w h image network input Fig. 6. Scan mapping of the input image on CMOL CrossNet inputs. Red lines show the possible time sequence of image pixels sent to a certain input of the network processing image from the upper-left panel of the pattern Indeed, let us consider a 4-Megapixel image partitioned into 4K 32×32-pixel panels (h = w = 32). This panel will require an MLP net with several (say, four) layers with 1K cells each in order to compare the panel image with ~10 3 stored faces. With the feasible 4-nm nanowire half-pitch, and 65-level synapses (sufficient for better than 99% fidelity [9]), each interlayer crossbar would require chip area about (4K×64 nm)2 = 64×64 μm2, fitting 4×4K of them on a ~0.6 cm2 chip. (The CMOS somatic-layer and communication-system overheads are negligible.) With the acceptable power consumption of the order of 10 W/cm2, the input-to-output signal propagation in such a network will take only about 50 ns, so that Δt may be of the order of 100 ns and the total time T = hwΔt of processing one frame of the order of 100 microseconds, much shorter than the typical TV frame time of ~10 milliseconds. The remaining two-orders-of-magnitude time gap may be used, for example, for double-checking the results via stopping the scan mapping (Fig. 6) at the most promising position. (For this, a simple feedback from the recognition output to the mapping communication system is necessary.) It is instructive to compare the estimated CMOL chip speed with that of the implementation of a similar parallel network ensemble on a CMOS signal processor (say, also combined on the same chip with an array of CMOS photodetectors). Even assuming an extremely high performance of 30 billion additions/multiplications per second, we would need ~4×4K×1K×(4K)2/(30×109) ≈ 104 seconds ~ 3 hours per frame, evidently incompatible with the online image stream processing. Let us finish with a brief (and much more speculative) discussion of possible long-term prospects of CMOL CrossNets. Eventually, large-scale (~30×30 cm2) CMOL circuits may become available. According to the estimates given in the previous section, the integration scale of such a system (in terms of both neural cells and synapses) will be comparable with that of the human cerebral cortex. Equipped with a set of broadband sensor/actuator interfaces, such (necessarily, hierarchical) system may be capable, after a period of initial supervised training, of further self-training in the process of interaction with environment, with the speed several orders of magnitude higher than that of its biological prototypes. Needless to say, the successful development of such self-developing systems would have a major impact not only on all information technologies, but also on the society as a whole. Acknowledgments This work has been supported in part by the AFOSR, MARCO (via FENA Center), and NSF. Valuable contributions made by Simon Fölling, Özgür Türel and Ibrahim Muckra, as well as useful discussions with P. Adams, J. Barhen, D. Hammerstrom, V. Protopopescu, T. Sejnowski, and D. Strukov are gratefully acknowledged. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Frank, D. J. et al. (2001) Device scaling limits of Si MOSFETs and their application dependencies. Proc. IEEE 89(3): 259-288. Likharev, K. K. (2003) Electronics below 10 nm, in J. Greer et al. (eds.), Nano and Giga Challenges in Microelectronics, pp. 27-68. Amsterdam: Elsevier. Likharev, K. K. and Strukov, D. B. (2005) CMOL: Devices, circuits, and architectures, in G. Cuniberti et al. (eds.), Introducing Molecular Electronics, Ch. 16. Springer, Berlin. Fölling, S., Türel, Ö. & Likharev, K. K. (2001) Single-electron latching switches as nanoscale synapses, in Proc. of the 2001 Int. Joint Conf. on Neural Networks, pp. 216-221. Mount Royal, NJ: Int. Neural Network Society. Wang, W. et al. (2003) Mechanism of electron conduction in self-assembled alkanethiol monolayer devices. Phys. Rev. B 68(3): 035416 1-8. Stan M. et al. (2003) Molecular electronics: From devices and interconnect to circuits and architecture, Proc. IEEE 91(11): 1940-1957. Strukov, D. B. & Likharev, K. K. (2005) Prospects for terabit-scale nanoelectronic memories. Nanotechnology 16(1): 137-148. Strukov, D. B. & Likharev, K. K. (2005) CMOL FPGA: A reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6): 888-900. Türel, Ö. et al. (2004) Neuromorphic architectures for nanoelectronic circuits”, Int. J. of Circuit Theory and Appl. 32(5): 277-302. See, e.g., Hertz J. et al. (1991) Introduction to the Theory of Neural Computation. Cambridge, MA: Perseus. Lee, J. H. & Likharev, K. K. (2005) CrossNets as pattern classifiers. Lecture Notes in Computer Sciences 3575: 434-441.
2 0.96612662 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning
Author: Artur Garcez, Luis C. Lamb, Dov M. Gabbay
Abstract: We present a new connectionist model for constructive, intuitionistic modal reasoning. We use ensembles of neural networks to represent intuitionistic modal theories, and show that for each intuitionistic modal program there exists a corresponding neural network ensemble that computes the program. This provides a massively parallel model for intuitionistic modal reasoning, and sets the scene for integrated reasoning, knowledge representation, and learning of intuitionistic theories in neural networks, since the networks in the ensemble can be trained by examples using standard neural learning algorithms. 1
3 0.96046036 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman
Abstract: We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.
same-paper 4 0.95413017 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
5 0.88821709 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao
Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1
6 0.69505852 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
7 0.62137121 181 nips-2005-Spiking Inputs to a Winner-take-all Network
8 0.61329412 149 nips-2005-Optimal cue selection strategy
9 0.57963955 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
10 0.57908815 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
11 0.57265335 20 nips-2005-Affine Structure From Sound
12 0.56980711 169 nips-2005-Saliency Based on Information Maximization
13 0.56765735 99 nips-2005-Integrate-and-Fire models with adaptation are good enough
14 0.56212819 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
15 0.55479419 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise
16 0.54544306 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
17 0.54215091 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
18 0.5418731 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
19 0.53826421 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
20 0.53521514 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs