nips nips2007 nips2007-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e. [sent-8, score-0.322]
2 , different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). [sent-10, score-0.379]
3 We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. [sent-11, score-0.471]
4 HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. [sent-12, score-0.188]
5 We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. [sent-13, score-0.264]
6 Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. [sent-14, score-0.264]
7 1 Introduction Microarray and other high-throughput measurement devices have been applied to examine specimens such as cancer tissues of biological and/or clinical interest. [sent-15, score-0.195]
8 The next step is to go towards combinatorial studies in which tissues measured by two or more of such devices are simultaneously analyzed. [sent-16, score-0.134]
9 Also, when concatenating a data set from different measurement sources, we often observe systematic missing parts in a dataset (e. [sent-18, score-0.414]
10 All these induce a heterogeneous structure in data, that needs to be treated appropriately. [sent-22, score-0.141]
11 Our work will contribute exactly to this topic, by proposing a Bayesian method for feature subspace extraction, called heterogeneous component analysis (HCA, sections 2 and 3). [sent-23, score-0.169]
12 HCA performs a linear feature extraction based on matrix factorization in order to obtain a sparse and structured representation. [sent-24, score-0.324]
13 After relating to previous methods (section 4), HCA is applied to toy data and more interestingly to neuroblastoma data from different measurement techniques (section 5). [sent-25, score-0.211]
14 2 Formulation of the HCA problem Let a matrix Y = {yij }i=1:M,j=1:N denote a set of N observations of M -dimensional feature vectors, where yij ∈ R is the j-th observation of the i-th feature. [sent-27, score-0.256]
15 In a heterogeneous situation, we assume the M -dimensional feature vector is decomposed into L disjoint blocks. [sent-28, score-0.146]
16 The observation matrix Y consists of multiple samples j = 1, . [sent-31, score-0.118]
17 There are many missing observations whose distribution is highly structural depending on each block. [sent-36, score-0.307]
18 HCA optimally factorizes the matrix Y so that the factor-loading matrix U has structural sparseness; it includes some regions of zero elements according to the block structure of the observed data. [sent-37, score-0.352]
19 Each factor may or may not affect all the features within a block, but each block does not necessarily affect all the factors. [sent-38, score-0.242]
20 Therefore, the rank of each factor-loading sub-matrix for each block (or any set of blocks) can be different from the others. [sent-39, score-0.195]
21 The resulting block-wise sparse matrix reflects a characteristic heterogeneity of features over blocks. [sent-40, score-0.201]
22 There may be missing or unmeasured observations denoted by a matrix W ∈ {0, 1}M ×N , which indicates observation yij is missing if wij = 0 or exists otherwise (wij = 1). [sent-43, score-0.93]
23 In this example, the observed data matrix (left panel) is made up by three blocks of features. [sent-45, score-0.133]
24 They have block-wise variation in effective dimensionalities, missing rates, observation noise levels, and so on, which we overall call heterogeneity. [sent-46, score-0.346]
25 To better understand the objective data based on the feature extraction by matrix factorization, we assume a block-wise sparse factor-loading matrix U (right panel in Fig. [sent-49, score-0.328]
26 Namely, the effective rank of an observation sub-matrix corresponding to a block is reflected by the number of non-zero components in the corresponding rows of U . [sent-51, score-0.234]
27 Assuming such a block-wise sparse structure can decrease the model’s effective complexity, and will describe the data better and therefore lead to better generalization ability, e. [sent-52, score-0.121]
28 For a factor matrix V , we assume a Gaussian prior: N K ln p(V ) = j=1 k=1 1 2 − vjk − ln 2π . [sent-56, score-0.559]
29 Another special case where each block contains only one active feature is probabilistic factor analysis (FA). [sent-58, score-0.244]
30 ln p(Y , V |U , σ 2 ) = 1 2 −2 2 wij −σl(i) e2 − ln σl(i) − ln 2π + ij ij 1 2 2 −vjk − ln 2π . [sent-60, score-0.709]
31 Since wij = 0 iff yij is missing, the summation ij is actually taken for all observed values in Y . [sent-65, score-0.238]
32 Each column vector of the mask matrix takes one of the possible block-wise mask patterns; a binary pattern vector whose dimensionality is the same as the factor-loading vector, and whose values are consistent, either 0 or 1, within each block. [sent-67, score-0.255]
33 When there are L blocks, each column vector of T can take one of 2L possible patterns including the zero vector, and hence, the matrix T with K columns can take one of 2LK possible patterns. [sent-68, score-0.106]
34 ] is the average over all pairs (i, j) not missing in the l-th block. [sent-80, score-0.307]
35 Model selection The mask matrix T was determined by maximization of the log-marginal likelihood LdU dV which was calculated by Laplace approximation around the MAP estimator: 1 def E(T ) = L − lndetH, 2 def where H = ∂2 L ∂θ∂θ T (8) is the Hessian of log-joint w. [sent-88, score-0.293]
36 In each step of the HCA-greedy algorithm, factor-loading and factor vectors are estimated based on 2L possible settings of block-wise mask vectors, and we accept the one achieving the maximum log-marginal. [sent-95, score-0.179]
37 It terminated if zero vector is accepted as the best mask vector. [sent-96, score-0.125]
38 The prior for U is given by ln p(U |α) = 1 2 L K −αlk u2 + ln αlk − ln 2π ik l=1 k=1 , (10) i∈Il where αlk is an ARD hyper-parameter for the l-th block of the k-th column of U . [sent-99, score-0.637]
39 With this prior, the log-joint probability density function becomes 1 1 −2 2 2 wij −σl(i) e2 − ln σl(i) − ln 2π + −vjk − ln 2π ln p(Y , U , V |σ 2 , α) = ij 2 ij 2 jk 1 + 2 −αl(i)k u2 + ln αl(i)k − ln 2π . [sent-107, score-1.006]
40 ik (11) ik According to this ARD approach, α is updated by the conjugate gradient-based optimization simultaneously with U and V . [sent-108, score-0.124]
41 4 Related work In this work, the ideas from both probabilistic modeling of linear component analyzers and sparse matrix factorization frameworks are combined into an analytical tool for data with underlying heterogeneous structures. [sent-117, score-0.397]
42 The weighted low-rank matrix factorization (WLRMF) [3] has been proposed as a minimization problem of the weighted error: uik vjk )2 , wij (yij − min = U ,V i,j (12) k where wij is a weight for the element yij of the observation matrix Y . [sent-118, score-0.847]
43 The weight value is set as wij = 0 if the corresponding yij is missing or wij > 0 otherwise. [sent-119, score-0.594]
44 This objective function is equivalent to the (negative) log-likelihood of a probabilistic generative model based on an assumption that each element of the residual matrix obeys a Gaussian distribution with variance 1/wij . [sent-120, score-0.152]
45 Although the prior term, 2 ln p(V ) = − 1 jk vjk + const. [sent-122, score-0.327]
46 Bayesian PCA [2] is also a matrix factorization procedure, which includes a characteristic prior 1 density of factor-loading vectors, ln p(U |α) = − 2 ik αk u2 + const. [sent-125, score-0.341]
47 It is an equivalent prior for ik (A) Missing pattern (B) True (C) factor loading (D) SVD WLRMF (I) 1 50 50 50 100 100 100 100 150 150 150 150 50 (E) 100 BPCA (F) 2468 10 HCA-greedy (G) HCA-ARD 0. [sent-127, score-0.148]
48 Vertical and horizontal axes correspond to row (typically, genes) and column (typically, samples) of the matrix (typically, gene expression matrix). [sent-134, score-0.169]
49 Panel (I) shows missing value prediction performance obtained by the three HCA algorithms and other methods. [sent-141, score-0.307]
50 The vertical and horizontal axes denote normalized root mean square of test errors and dimensionalities of factors, respectively. [sent-142, score-0.172]
51 Component analyzers with sparse factor-loadings have recently been investigated as sparse PCA (SPCA). [sent-146, score-0.182]
52 [4]), the tradeoff problem is solved between the understandability (sparsity of factor-loadings) and the reproducibility of the covariance matrix from the sparsified factor-loadings. [sent-149, score-0.116]
53 In our HCA, the block-wise sparse factor-loading matrix is useful not only for understandability but also for generalization ability. [sent-150, score-0.213]
54 The latter merit comes from the assumption that the observation includes uncertainty due to a small sample size, large noises, and missing observations, which have not been considered sufficiently in SPCA. [sent-151, score-0.346]
55 5 Experiments Experiment 1: an artificial dataset We prepared an artificial data set with an underlying block structure. [sent-152, score-0.199]
56 For this we generated a 170 × 9 factor-loading matrix U that included a pre-determined block structure (white vs. [sent-153, score-0.273]
57 2(B)), and a 100 × 9 factor matrix V by applying orthogonalization to the factors sampled from a standard Gaussian distribution. [sent-155, score-0.227]
58 The observation matrix Y was produced by U V T + E, where each element of E was generated from a standard Gaussian. [sent-156, score-0.141]
59 Then, missing values were artificially introduced according to the pre-determined block structure (Fig. [sent-157, score-0.501]
60 • Block 1 consisted of 20 features with randomly selected 10 % missing entries. [sent-159, score-0.365]
61 • Block 2 consisted of 50 features whose 50% columns were completely missing and the remaining columns contained randomly selected 50% missing entries. [sent-160, score-0.726]
62 • Block 3 consisted of 100 features whose 20% columns were completely missing and the remaining columns contained randomly selected 20% missing entries. [sent-161, score-0.726]
63 We applied three HCA algorithms: HCA-greedy, HCA-ARD, and HCA-g+ARD, and three existing matrix factorization algorithms: SVD, WLRMF and BPCA. [sent-162, score-0.144]
64 SVD SVD calculated for a matrix whose missing values are imputed to zeros. [sent-163, score-0.386]
65 WLRMF[3] The weights were set 1 for the value-existing entries or 0 for the missing entries. [sent-164, score-0.345]
66 BPCA WLRMF with an ARD prior, called here BPCA, which is equivalent to HCA-ARD except that all features are in a single active block (i. [sent-165, score-0.197]
67 The generalization ability was evaluated on the basis of the estimation performance for artificially introduced missing values. [sent-170, score-0.335]
68 The estimated factor-loading matrices and missing value estimation accuracies are shown in Figure 2. [sent-171, score-0.358]
69 The factor-loading matrix estimated by HCAgreedy showed an identical sparse structure to the one consisting of the top five factors in the true factor-loadings. [sent-174, score-0.297]
70 The sixth factor in the second block was not extracted, possibly because the second block lacked information due to the large rate of missing values. [sent-175, score-0.692]
71 As shown in panel (I), however, such a poorly extracted structure did not increase the generalization error, implying that the essential structure underlying the data was extracted well by the three HCAbased algorithms. [sent-178, score-0.209]
72 def Reconstruction of missing values was evaluated by normalized root mean square errors: NRMSE = mean[(y − y )2 ]/var[y], where y and y denote true and estimated values, respectively, the mean ˜ ˜ is the average over all the missing entries and the variance is for all entries of the matrix. [sent-179, score-0.798]
73 Figure 2(I) shows the generalization ability of missing value predictions. [sent-180, score-0.335]
74 (A) Missing entries (B) Factor loading (HCA-greedy) (C) Factor loading (WLRMF) array CGH 1000 1000 2000 2000 Microarray 1 Microarray 2 100 200 Samples 300 2448 5 10 Factors 15 20 2448 5 10 Factors 15 20 Figure 3: Analysis of an NBL dataset. [sent-192, score-0.173]
75 Features measured by array CGH technology are sorted in the chromosomal order. [sent-194, score-0.131]
76 White and red colors denote observed and missing entries in the data matrix, respectively. [sent-197, score-0.345]
77 Experiment 2: a cross-analysis of neuroblastoma data We next applied our HCA to a neuroblastoma (NBL) dataset consisting of three data blocks taken by three kinds of high-throughput genomic measurement technologies. [sent-199, score-0.407]
78 Array CGH Chromosomal changes of 2340 DNA segments (using 2340 probes) were measured for each of 230 NBL tumors, by using the array comparative genomic hybridization (array CGH) technology. [sent-200, score-0.12]
79 Microarray 1 Expression levels of 5340 genes were measured for 136 tumors from NBL patients. [sent-202, score-0.165]
80 Microarray 2 Gene expression levels in 25 out of 136 tumors were also measured by a small-sized microarray technology harboring 448 probes. [sent-204, score-0.289]
81 As seen in Figure 3(A), the set of measured samples was quite different in the three experiments, leading to apparent block-wise missing observations. [sent-206, score-0.348]
82 We further added 10% missing entries randomly into the observed entries in order to evaluate missing value prediction performance. [sent-208, score-0.69]
83 When HCA-greedy was applied to this dataset, it terminated at K = 23, but we continued to obtain further factors until K = 80. [sent-209, score-0.14]
84 HCA-greedy extracted one factor showing the relationship between the three measurement devices and three factors between aCGH and Microarray 1. [sent-211, score-0.332]
85 This suggests that the dataset Microarray 2 did not include factors other than the first one as those strongly related to the prognosis. [sent-215, score-0.132]
86 On the other hand, WLRMF extracted the identical first factor to HCA-greedy, but extracted much more factors concerning Microarray 2, all of which may not be trustworthy because the number of samples observed in Microarray 2 was as small as 25. [sent-216, score-0.238]
87 Horizontal axis denotes the number of factors (A and B) or the number of non-zero elements in the factor-loading matrices (C). [sent-241, score-0.156]
88 Note that the original data matrix included many missing values, but we evaluated the performance by using artificially introduced missing values. [sent-249, score-0.693]
89 Training errors almost monotonically decreased as the number of factors increased (Fig. [sent-251, score-0.129]
90 SVD and WLRMF exhibited the best performance at K = 22 and K = 60, respectively, and got worse as the number of factors increased due to over-fitting. [sent-256, score-0.2]
91 Overall, the variants of our new HCA concept have shown good generalization performance as measured on missing values, much similar to existing methods like WLRMF. [sent-257, score-0.376]
92 Our Bayesian HCA model allows to take into account such structured feature vectors that possess different intrinsic blocking characteristics. [sent-261, score-0.2]
93 The new probabilistic structured matrix factorization framework was applied to toy data and to neuroblastoma data collected by multiple high-throughput measurement devices which had block-wise missing structures due to different experimental designs. [sent-262, score-0.776]
94 HCA achieved a block-wise sparse factor-loading matrix, representing the information amount contained in each block of the dataset simultaneously. [sent-263, score-0.268]
95 While HCA provided a better or similar missing value prediction performance than existing methods such as BPCA or WLRMF, the heterogeneous structure underlying the problem was clearly captured much better. [sent-264, score-0.448]
96 Furthermore the HCA factors derived are an interesting representation that may ultimately lead to a better modeling of the neuroblastoma data (see section 5). [sent-265, score-0.213]
97 In the current HCA implementation, block structures were assumed to be known, as for the neuroblastoma data. [sent-266, score-0.28]
98 Clearly there is an increasing need for methods that are able to reliably extract factors from multimodal structured data with heterogeneous features. [sent-268, score-0.273]
99 A Bayesian missing value estimation method for gene expression profile data. [sent-305, score-0.337]
100 Expression profiling using a tumor-specific cDNA microarray predicts the prognosis of intermediate risk neuroblastomas. [sent-322, score-0.224]
wordName wordTfidf (topN-words)
[('hca', 0.533), ('wlrmf', 0.349), ('missing', 0.307), ('bpca', 0.221), ('ard', 0.206), ('block', 0.17), ('microarray', 0.169), ('vjk', 0.165), ('svd', 0.135), ('ln', 0.135), ('heterogeneous', 0.117), ('nbl', 0.11), ('neuroblastoma', 0.11), ('nrmse', 0.11), ('uik', 0.11), ('yij', 0.109), ('factors', 0.103), ('wij', 0.089), ('mask', 0.088), ('cgh', 0.08), ('matrix', 0.079), ('measurement', 0.078), ('sparse', 0.069), ('exhibited', 0.068), ('factorization', 0.065), ('def', 0.063), ('ik', 0.062), ('devices', 0.061), ('netlab', 0.055), ('prognosis', 0.055), ('tik', 0.055), ('blocks', 0.054), ('structured', 0.053), ('array', 0.053), ('dimensionalities', 0.048), ('tumors', 0.048), ('lk', 0.045), ('genes', 0.045), ('extracted', 0.045), ('factor', 0.045), ('bioinformatics', 0.044), ('analyzers', 0.044), ('panel', 0.043), ('loading', 0.041), ('measured', 0.041), ('ij', 0.04), ('pca', 0.04), ('observation', 0.039), ('entries', 0.038), ('vertical', 0.038), ('acgh', 0.037), ('chromosomal', 0.037), ('hcagreedy', 0.037), ('lndeth', 0.037), ('oba', 0.037), ('spca', 0.037), ('understandability', 0.037), ('terminated', 0.037), ('intrinsic', 0.033), ('possess', 0.032), ('probes', 0.032), ('tissues', 0.032), ('cially', 0.031), ('levels', 0.031), ('consisted', 0.031), ('axes', 0.031), ('gene', 0.03), ('horizontal', 0.029), ('assay', 0.029), ('blocking', 0.029), ('got', 0.029), ('matrices', 0.029), ('feature', 0.029), ('extraction', 0.029), ('dataset', 0.029), ('generalization', 0.028), ('bayesian', 0.028), ('residual', 0.027), ('eij', 0.027), ('columns', 0.027), ('jk', 0.027), ('features', 0.027), ('errors', 0.026), ('genomic', 0.026), ('heterogeneity', 0.026), ('rank', 0.025), ('fa', 0.024), ('clinical', 0.024), ('constituting', 0.024), ('axis', 0.024), ('structure', 0.024), ('vectors', 0.024), ('component', 0.023), ('white', 0.023), ('toy', 0.023), ('element', 0.023), ('variance', 0.023), ('pro', 0.023), ('estimated', 0.022), ('sparsity', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
2 0.18283783 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
3 0.093301013 156 nips-2007-Predictive Matrix-Variate t Models
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1
4 0.079998292 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
5 0.068872146 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
Author: José M. Hernández-lobato, Tjeerd Dijkstra, Tom Heskes
Abstract: We introduce a hierarchical Bayesian model for the discovery of putative regulators from gene expression data only. The hierarchy incorporates the knowledge that there are just a few regulators that by themselves only regulate a handful of genes. This is implemented through a so-called spike-and-slab prior, a mixture of Gaussians with different widths, with mixing weights from a hierarchical Bernoulli model. For efficient inference we implemented expectation propagation. Running the model on a malaria parasite data set, we found four genes with significant homology to transcription factors in an amoebe, one RNA regulator and three genes of unknown function (out of the top ten genes considered).
6 0.062439147 135 nips-2007-Multi-task Gaussian Process Prediction
7 0.059992392 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
8 0.05640693 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
9 0.052414365 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
10 0.051483907 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
11 0.049995571 63 nips-2007-Convex Relaxations of Latent Variable Training
12 0.04840884 182 nips-2007-Sparse deep belief net model for visual area V2
13 0.047058489 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
14 0.043964252 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
15 0.043118738 158 nips-2007-Probabilistic Matrix Factorization
16 0.042631581 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
17 0.040991209 99 nips-2007-Hierarchical Penalization
18 0.039654613 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
19 0.039055884 49 nips-2007-Colored Maximum Variance Unfolding
20 0.035133105 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
topicId topicWeight
[(0, -0.145), (1, 0.045), (2, -0.017), (3, 0.007), (4, -0.014), (5, 0.003), (6, -0.084), (7, 0.029), (8, -0.029), (9, -0.023), (10, 0.045), (11, -0.026), (12, 0.092), (13, 0.039), (14, 0.023), (15, 0.01), (16, -0.034), (17, 0.037), (18, -0.004), (19, -0.164), (20, -0.127), (21, 0.049), (22, -0.036), (23, -0.014), (24, -0.054), (25, 0.168), (26, 0.022), (27, 0.007), (28, -0.094), (29, -0.032), (30, 0.001), (31, -0.1), (32, -0.005), (33, -0.1), (34, -0.03), (35, -0.195), (36, -0.017), (37, -0.009), (38, -0.03), (39, 0.042), (40, 0.063), (41, -0.062), (42, 0.134), (43, 0.161), (44, -0.063), (45, 0.111), (46, -0.097), (47, -0.065), (48, 0.042), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.93947858 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
2 0.70554346 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
3 0.61415219 156 nips-2007-Predictive Matrix-Variate t Models
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1
4 0.57281137 158 nips-2007-Probabilistic Matrix Factorization
Author: Andriy Mnih, Ruslan Salakhutdinov
Abstract: Many existing approaches to collaborative filtering can neither handle very large datasets nor easily deal with users who have very few ratings. In this paper we present the Probabilistic Matrix Factorization (PMF) model which scales linearly with the number of observations and, more importantly, performs well on the large, sparse, and very imbalanced Netflix dataset. We further extend the PMF model to include an adaptive prior on the model parameters and show how the model capacity can be controlled automatically. Finally, we introduce a constrained version of the PMF model that is based on the assumption that users who have rated similar sets of movies are likely to have similar preferences. The resulting model is able to generalize considerably better for users with very few ratings. When the predictions of multiple PMF models are linearly combined with the predictions of Restricted Boltzmann Machines models, we achieve an error rate of 0.8861, that is nearly 7% better than the score of Netflix’s own system.
5 0.56517297 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
Author: José M. Hernández-lobato, Tjeerd Dijkstra, Tom Heskes
Abstract: We introduce a hierarchical Bayesian model for the discovery of putative regulators from gene expression data only. The hierarchy incorporates the knowledge that there are just a few regulators that by themselves only regulate a handful of genes. This is implemented through a so-called spike-and-slab prior, a mixture of Gaussians with different widths, with mixing weights from a hierarchical Bernoulli model. For efficient inference we implemented expectation propagation. Running the model on a malaria parasite data set, we found four genes with significant homology to transcription factors in an amoebe, one RNA regulator and three genes of unknown function (out of the top ten genes considered).
6 0.50103337 145 nips-2007-On Sparsity and Overcompleteness in Image Models
7 0.48083389 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
8 0.44600716 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
9 0.38770235 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
10 0.37407878 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
11 0.36170465 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis
12 0.34900582 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
13 0.33998704 195 nips-2007-The Generalized FITC Approximation
14 0.33488432 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
15 0.33396304 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
16 0.3130973 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
17 0.31097624 193 nips-2007-The Distribution Family of Similarity Distances
18 0.29737362 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
19 0.29645914 135 nips-2007-Multi-task Gaussian Process Prediction
20 0.2947785 53 nips-2007-Compressed Regression
topicId topicWeight
[(5, 0.059), (9, 0.317), (13, 0.029), (16, 0.036), (18, 0.018), (19, 0.02), (21, 0.048), (31, 0.018), (34, 0.015), (35, 0.038), (47, 0.089), (49, 0.024), (83, 0.083), (85, 0.02), (87, 0.019), (90, 0.091)]
simIndex simValue paperId paperTitle
1 0.73551458 30 nips-2007-Bayes-Adaptive POMDPs
Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau
Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1
same-paper 2 0.72499478 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
3 0.52799618 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
4 0.50626522 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa
Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1
5 0.50070155 215 nips-2007-What makes some POMDP problems easy to approximate?
Author: Wee S. Lee, Nan Rong, Daniel J. Hsu
Abstract: Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms’ success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a suitable set of points that “cover” an optimal reachable space well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity of POMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices. 1
6 0.49088889 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
7 0.48489791 158 nips-2007-Probabilistic Matrix Factorization
8 0.46144599 182 nips-2007-Sparse deep belief net model for visual area V2
9 0.45612976 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
10 0.4526521 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
11 0.44980466 63 nips-2007-Convex Relaxations of Latent Variable Training
12 0.44875687 156 nips-2007-Predictive Matrix-Variate t Models
13 0.44751075 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
14 0.4466396 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
15 0.44552487 195 nips-2007-The Generalized FITC Approximation
16 0.44394237 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
17 0.44388855 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency
18 0.44365972 174 nips-2007-Selecting Observations against Adversarial Objectives
19 0.4424524 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
20 0.44213018 164 nips-2007-Receptive Fields without Spike-Triggering