jmlr jmlr2010 jmlr2010-6 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
Reference: text
sentIndex sentText sentNum sentScore
1 However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. [sent-7, score-0.606]
2 We develop a new statistical test based on random rotations for verifying the existence of latent variables. [sent-8, score-0.752]
3 The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. [sent-12, score-0.652]
4 Introduction The problem we consider here is one of verifying statistically that an apparent latent variable is real. [sent-14, score-0.614]
5 Sometimes a suspected latent variable can be confirmed by looking more closely at the data or lab notes. [sent-31, score-0.589]
6 When we are confident that the variable is real then it makes sense to use a model that includes one or more latent variables. [sent-34, score-0.589]
7 A natural way to test for a latent variable is to compute a singular value decomposition of the residual matrix E = Y − X B and decide that a latent variable is present when the largest singular value of E is sufficiently large. [sent-35, score-1.37]
8 Adding a Gaussian latent variable simply changes the covariance structure and hence is not detectable. [sent-38, score-0.616]
9 While the eigenvalues offer no possibility to confirm the presence of a latent variable in correlated Gaussian noise, the eigenvectors of the covariance matrix do. [sent-41, score-0.681]
10 When the noise is independent across observations and comes from a multivariate Gaussian distribution, then under the null hypothesis of no latent variable, random rotations don’t change the distribution of our test statistic. [sent-46, score-0.859]
11 Our main contribution is to extend rotation tests to the context of regression for the explicit purpose of detecting latent variables. [sent-49, score-0.915]
12 Our principal focus is on testing for the presence versus the absence of one latent variable. [sent-51, score-0.612]
13 The regression model is usually used when we expect no latent variables. [sent-52, score-0.579]
14 The presence of even one latent variable would make it reasonable to switch to a factor model. [sent-53, score-0.589]
15 We also consider sequential tests for the correct number of latent variables when there is at least one of them. [sent-54, score-0.63]
16 The outline of the paper is as follows: Section 2 introduces the AGEMAP data as a motivating example and introduces the regression model mixing measured and latent predictors. [sent-55, score-0.579]
17 Section 3 develops rotation tests for the existence of latent structure in the residual matrix from a regression. [sent-56, score-1.031]
18 We also show that Gaussian latent vectors cannot be detected, and then present some test statistics for non-Gaussian latent vectors. [sent-58, score-1.138]
19 The test is able to identify large latent variables and we find that it gives reliable p-values when no latent variables are present. [sent-60, score-1.138]
20 Background Here we describe the AGEMAP data and introduce regression models that include both measured and latent variables. [sent-64, score-0.579]
21 There were five male mice at each age and five female mice at each age. [sent-71, score-0.618]
22 Now suppose that there is a latent variable taking the value Ui for the array of the ith mouse. [sent-94, score-0.618]
23 Then adding the latent variable to the regression (1) yields Yi j = β0 j + β1 j Ai + β2 j Si + γ jUi + εi j , 1 ≤ i ≤ n, where both γ j and Ui are unknown. [sent-95, score-0.621]
24 Each individual regression includes more parameters than observations, having n latent values Ui and a coefficient γ j . [sent-98, score-0.579]
25 3 Forcing Identifiability The latent variable model in (2) is not identifiable. [sent-101, score-0.589]
26 The existence of a latent term is not affected by identifiability of U, so we won’t have to force U to be identifiable to detect a latent variable. [sent-110, score-1.094]
27 We can estimate the latent term U Γ from the residual matrix E = Y − X B. [sent-130, score-0.695]
28 If there is some prior knowledge about the distribution of the possible latent factors, this knowledge can be used in the estimation procedure, for example by choosing a particular test statistic to use in Section 3. [sent-134, score-0.649]
29 The primary task is to determine whether any latent structure exists in the residual matrix Y − X B. [sent-137, score-0.695]
30 Most methods for identifying latent structure only look at the singular values of the residual matrix E. [sent-144, score-0.695]
31 1 Rotations Under the Null Hypothesis Under the null hypothesis of no latent variable, our error term is E ∼ N (0, I ⊗ ΣN ). [sent-149, score-0.624]
32 2 Testing for the Existence of Structure Here we construct a test for latent structure in the residual matrix. [sent-216, score-0.701]
33 We will construct a rotation test by modifying the rotation tests in Langsrud (2005). [sent-218, score-0.633]
34 3 Gaussian Alternative Here we show that when ΣN is unknown, a Gaussian latent variable cannot be detected. [sent-234, score-0.589]
35 To see how the problem manifests, consider the model in (3) with just one latent variable with entries Ui ∼ N (0, 1) independently of Z. [sent-237, score-0.589]
36 The implication of Proposition 4 is that if we don’t know anything about ΣN , or Γ, then a Gaussian latent variable is impossible to detect. [sent-250, score-0.589]
37 Also, if a latent variable corresponds to a roughly-linear time trend, then it will be nearly uniformly distributed if the points are sampled at regular time intervals. [sent-254, score-0.589]
38 4 Choice of the Test Statistic, T Since a Gaussian latent variable is covered by the correlation model and is not detectable, any effective test statistic T must be tuned for non-Gaussian latent variables. [sent-257, score-1.265]
39 A non-Gaussian latent T/2 variable makes for an error term UΓ + ZΣN that does not have a rotationally invariant distribution. [sent-258, score-0.589]
40 The rotation based p-value (6) is sensitive to large values of TL1 and should therefore catch dichotomies and light tailed latent variables. [sent-280, score-0.881]
41 In particular, it is possible to detect the existence of latent structure in E using a PCA-based test statistic and then fit the structure using ICA. [sent-294, score-0.649]
42 5 Identifying the Rank of the Latent Term When we are able to reject the null hypothesis we conclude that some latent structure exists, but we do not know the rank of U. [sent-298, score-0.624]
43 To estimate the number of latent variables we consider a sequential approach, based on subtracting estimated latent variables and looking for latent structure in the residuals. [sent-299, score-1.672]
44 If we determine that any structure exists, we fit a single latent variable u1 . [sent-303, score-0.589]
45 We proceed in a sequential matter: test for structure in Ei ; upon identifying structure, fit a single latent variable, treat it as a covariate, and get a new residual matrix Ei+1 . [sent-307, score-0.739]
46 One has to be careful when testing for more than one latent term. [sent-313, score-0.588]
47 In particular, for some settings when n ≪ N, it is impossible to consistently estimate the latent variables ui . [sent-314, score-0.59]
48 6 Caveats When we reject the null hypothesis, then either there is strong enough latent structure in the data, or the noise is far from Gaussian. [sent-320, score-0.621]
49 Therefore, rejecting the null hypothesis is necessary to deem latent structure to be real, but not sufficient. [sent-321, score-0.649]
50 An outlier can be modeled using a latent variable that has support on a single observation. [sent-323, score-0.589]
51 Bi-modal noise can be re-cast as a clumping latent effect. [sent-324, score-0.619]
52 The existence of an unnormalized latent variable implies that a normalized one exists, and so the testing problem is unaffected. [sent-359, score-0.63]
53 We are interested in what happens when testing for the first, second, third, and fourth latent terms. [sent-380, score-0.588]
54 The upper-right panel shows the results from testing the residual matrix after one latent term has been removed. [sent-383, score-0.736]
55 The latent variables of the randomly-rotated data are more non-Gaussian than the latent variable estimated from the original data. [sent-389, score-1.167]
56 The lower-right panel shows the estimated p-values after three latent terms have been fit and removed from the residual matrix. [sent-390, score-0.688]
57 Finally, testing for a fourth latent variable gives us a uniform p-value, which is exactly what we want since there are only three latent terms. [sent-393, score-1.177]
58 A word is in order about how we removed the first latent variable when testing for the presence of the second. [sent-394, score-0.63]
59 3 Testing Under and Near the Null Hypothesis In the previous simulation, the signal-to-noise ratio between the latent effect terms and the random error is relatively high, and so the p-values for non-existent latent terms are faithful. [sent-399, score-1.094]
60 In this simulation, we demonstrate that the p-values for testing for multiple latent variables are slightly liberal if the signal strength is too weak, but these p-values are still within tolerable accuracy. [sent-400, score-0.588]
61 There is a single latent variable u which has elements equal to −1 or +1 with equal probability. [sent-402, score-0.589]
62 2 tells us that the p-value from a rotation test of a single latent term is uniformly-distributed when λ = 0. [sent-408, score-0.844]
63 The issue is whether errors in the estimated first latent vector spoil the test for the second. [sent-411, score-0.622]
64 We simulate data from a model with three latent terms and then apply rotation tests for latent structure. [sent-415, score-1.43]
65 2) Fit a single latent variable u, using the first term in the SVD of Y . [sent-428, score-0.589]
66 ˆˆ 4) Test for the existence of more latent terms using a PCA-based rotation test using TEPP as our test statistic and treating u as a known covariate. [sent-430, score-0.97]
67 When the first latent variable is strong, then we have a reliable test for the second. [sent-437, score-0.633]
68 The left plot shows the latent variable estimated for the cerebellum in each of 39 mice plotted versus the ages of those mice. [sent-439, score-1.025]
69 The right plot shows a histogram of the regression coefficients for the latent variable. [sent-441, score-0.636]
70 We fit 16 regression models of gene activation on age and sex with one latent variable, one for each tissue. [sent-448, score-0.725]
71 (2) The latent variables for tissue 2 (the cerebellum) have a striking pattern. [sent-453, score-0.627]
72 Figure 3 shows that latent variable plotted versus age and with plot symbols encoding the sex of the mouse. [sent-455, score-0.705]
73 It is clear that the mice are split into two different groups, one with a high value of the latent variable and one low. [sent-456, score-0.859]
74 That cannot be the case here because the estimated latent variable is orthogonal to both the sex and age variables by construction, meaning the sum of its coefficients over male samples must equal the negative of the sum over females. [sent-458, score-0.785]
75 There are high and low values for the latent variable for the cerebellum. [sent-459, score-0.589]
76 The second panel of Figure 4 shows the histogram of these latent values. [sent-460, score-0.604]
77 This figure shows histograms of the latent variables found in microarray data from 16 mouse tissues. [sent-464, score-0.709]
78 In each histogram the latent variable values from up to 40 mice are given. [sent-465, score-0.916]
79 The biggest latent effect in these tissues is that the expression of one mouse was quite different from the other mice and that difference is reflected in a large number of genes. [sent-468, score-0.968]
80 For both of these tissues, the latent variable is splitting the mice into two groups. [sent-472, score-0.859]
81 Figure 5 plots the estimated latent variable from the cerebrum versus that for the cerebellum. [sent-475, score-0.743]
82 This figure plots the latent variable from the cerebrum versus that from the cerebellum for the 39 mice for which both arrays were available. [sent-478, score-1.117]
83 About one third of the mice have the rare cerebellum type, one third have the rare cerebrum type and the remaining mice the common form for both tissues. [sent-485, score-0.862]
84 A p value based on Fisher’s exact test is 619 P ERRY AND OWEN Count Rare cerebrum Common cerebrum Common cerebellum 13 12 Rare cerebellum 0 14 Table 1: Counts of mice in the corners of Figure 5. [sent-487, score-0.83]
85 The estimated latent variables for the cerebellum, cerebrum, and eye tissues exhibited dichotomies. [sent-496, score-0.691]
86 The gonad, spinal cord, and striatum latent variables have clear outliers, which manifest as a significantly low values of TL1 , and a significantly high values of TEPP . [sent-498, score-0.638]
87 The spleen latent variable potentially has an outlier at age 5 months, and is found to be marginally significant according to TEPP , and not significant according to TL1 . [sent-499, score-0.728]
88 The only case where TL1 and TEPP give drastically different results is with the latent variable ˆ estimated from the hippocampus data. [sent-501, score-0.669]
89 A possible explanation for why the latent variable is insignificant according to TL1 is this: TL1 is simultaneously measuring presence of outliers and presence of clumping. [sent-506, score-0.589]
90 Conclusions We find that it is possible to test for latent variables in correlated Gaussian noise by a rotation test using a projection pursuit index applied to the components of the first singular vector, instead of the usual test based on the size of the largest singular value. [sent-513, score-1.016]
91 Testing for one latent variable is theoretically justified and reliable. [sent-516, score-0.589]
92 It might uncover eigenvectors with especially 620 ROTATION T EST FOR L ATENT S TRUCTURE Figure 6: This figure shows histograms of 1000 realizations of the test statistics TL1 and TEPP after applying random rotations to the estimated latent variable. [sent-521, score-0.827]
93 The latent variable is found to be barely significant at the 0. [sent-530, score-0.589]
94 Our original interest was to see if thousands of genes could be used to define a genomic “true age” of a sample of mouse tissue as a latent variable in the residuals from a regression that did not include age. [sent-534, score-0.884]
95 It turned out that the dominant latent variable bore no resemblance to chronological 621 P ERRY AND OWEN TL1 Tissue Adrenal Bone Marrow Cerebellum Cerebrum Eye Gonad Heart Hippocampus Kidney Liver Lung Muscle Spleen Spinal Cord Striatum Thymus R2 0. [sent-535, score-0.589]
96 In the second column of the table, we indicate how much of the residual is explained by the latent term. [sent-620, score-0.657]
97 A Bonferroni correction for multiple testing would multiply the p-values by 32 and would find most of the same latent variables significant. [sent-622, score-0.588]
98 We never uncovered a biological explanation for the dichotomies and other latent variables that we saw. [sent-624, score-0.6]
99 Several of the tissues did not have apparent latent variables. [sent-626, score-0.657]
100 It may happen that a latent variable is statistically significant when judged by a rotation test but only explains a negligible amount of the response variation. [sent-628, score-0.886]
wordName wordTfidf (topN-words)
[('latent', 0.547), ('mice', 0.27), ('tepp', 0.27), ('rotation', 0.253), ('qt', 0.232), ('agemap', 0.209), ('owen', 0.209), ('rotations', 0.161), ('cerebellum', 0.135), ('erry', 0.135), ('atent', 0.126), ('cerebrum', 0.123), ('ot', 0.111), ('residual', 0.11), ('ohot', 0.098), ('tructure', 0.095), ('tissues', 0.085), ('tests', 0.083), ('tissue', 0.08), ('xb', 0.08), ('age', 0.078), ('mouse', 0.066), ('genes', 0.066), ('langsrud', 0.061), ('spleen', 0.061), ('zahn', 0.061), ('est', 0.059), ('statistic', 0.058), ('histogram', 0.057), ('pca', 0.055), ('rn', 0.053), ('dichotomies', 0.053), ('microarray', 0.052), ('residuals', 0.051), ('aging', 0.049), ('gonad', 0.049), ('hippocampus', 0.049), ('spinal', 0.049), ('orthogonal', 0.049), ('oe', 0.047), ('dichotomy', 0.047), ('test', 0.044), ('null', 0.044), ('histograms', 0.044), ('ui', 0.043), ('variable', 0.042), ('striatum', 0.042), ('clumping', 0.042), ('testing', 0.041), ('covariate', 0.041), ('tu', 0.041), ('iid', 0.04), ('matrix', 0.038), ('rotate', 0.038), ('sex', 0.038), ('cord', 0.037), ('perry', 0.037), ('qpqt', 0.037), ('orthogonally', 0.035), ('rotated', 0.035), ('microarrays', 0.035), ('ica', 0.033), ('hypothesis', 0.033), ('regression', 0.032), ('rare', 0.032), ('ox', 0.032), ('estimated', 0.031), ('gene', 0.03), ('noise', 0.03), ('array', 0.029), ('oy', 0.028), ('eye', 0.028), ('spiked', 0.028), ('tailed', 0.028), ('gaussian', 0.028), ('correlations', 0.027), ('covariance', 0.027), ('correlation', 0.027), ('correlated', 0.027), ('pursuit', 0.027), ('replicates', 0.026), ('stanford', 0.026), ('apparent', 0.025), ('baik', 0.025), ('bwct', 0.025), ('crop', 0.025), ('crossa', 0.025), ('deem', 0.025), ('dias', 0.025), ('dos', 0.025), ('epj', 0.025), ('kidney', 0.025), ('marrow', 0.025), ('ohy', 0.025), ('southworth', 0.025), ('uica', 0.025), ('upca', 0.025), ('validating', 0.025), ('principal', 0.024), ('treating', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
2 0.11322987 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
3 0.072344959 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
4 0.070093475 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
5 0.061275106 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
6 0.049087469 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
7 0.048487544 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
8 0.04778925 27 jmlr-2010-Consistent Nonparametric Tests of Independence
9 0.043063551 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
10 0.042531945 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
11 0.042355418 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
12 0.037842289 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
13 0.036270384 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
14 0.035933059 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
15 0.031519141 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
16 0.031393927 72 jmlr-2010-Matrix Completion from Noisy Entries
17 0.030906191 84 jmlr-2010-On Spectral Learning
18 0.029900052 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
19 0.029593376 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
20 0.029224467 69 jmlr-2010-Lp-Nested Symmetric Distributions
topicId topicWeight
[(0, -0.154), (1, 0.025), (2, -0.025), (3, 0.022), (4, -0.024), (5, -0.082), (6, 0.062), (7, -0.033), (8, 0.004), (9, -0.048), (10, -0.003), (11, 0.192), (12, -0.041), (13, 0.041), (14, 0.089), (15, 0.033), (16, -0.07), (17, 0.084), (18, -0.04), (19, -0.038), (20, -0.091), (21, 0.102), (22, -0.203), (23, 0.056), (24, -0.027), (25, -0.079), (26, 0.169), (27, 0.308), (28, 0.029), (29, 0.057), (30, -0.05), (31, -0.099), (32, 0.02), (33, -0.184), (34, 0.048), (35, -0.009), (36, -0.04), (37, 0.053), (38, 0.083), (39, -0.005), (40, 0.316), (41, -0.107), (42, 0.066), (43, 0.102), (44, 0.04), (45, 0.088), (46, -0.107), (47, 0.133), (48, -0.284), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.97243983 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
2 0.53723979 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
3 0.3883113 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
4 0.27379301 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
5 0.26067019 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
6 0.25455436 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
7 0.24769668 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
8 0.23988976 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
9 0.22591275 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
10 0.20486502 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
11 0.17843905 77 jmlr-2010-Model-based Boosting 2.0
12 0.17321633 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
13 0.15937905 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
14 0.15902296 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
15 0.15889449 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
16 0.1582839 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
17 0.15686415 109 jmlr-2010-Stochastic Composite Likelihood
18 0.15298685 102 jmlr-2010-Semi-Supervised Novelty Detection
19 0.15111126 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
20 0.15039042 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
topicId topicWeight
[(3, 0.016), (4, 0.017), (8, 0.013), (21, 0.012), (32, 0.031), (33, 0.016), (36, 0.024), (37, 0.037), (75, 0.678), (81, 0.01), (85, 0.055)]
simIndex simValue paperId paperTitle
1 0.99833596 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
2 0.99595201 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
3 0.99583441 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
4 0.99456638 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu
Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software
same-paper 5 0.9911446 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
6 0.98881119 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
7 0.93200058 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
8 0.93117523 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
10 0.92104012 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
11 0.90955871 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
12 0.90725738 84 jmlr-2010-On Spectral Learning
13 0.90588897 40 jmlr-2010-Fast and Scalable Local Kernel Machines
14 0.90512556 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
15 0.90374535 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
16 0.90242821 63 jmlr-2010-Learning Instance-Specific Predictive Models
17 0.90236312 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.90131265 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
19 0.89941329 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
20 0.89889324 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning