jmlr jmlr2013 jmlr2013-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aapo Hyvärinen, Stephen M. Smith
Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis
Reference: text
sentIndex sentText sentNum sentScore
1 We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. [sent-9, score-0.302]
2 Determining the directions of the connections can be performed by considering each connection separately, which requires, again, analysis of the causal direction between two variables. [sent-31, score-0.373]
3 Thus, we see that measuring pairwise causal directions is a central problem in the theory of LiNGAM and related models. [sent-37, score-0.321]
4 The approach uses the ratio of the likelihoods of the models corresponding to the two directions of causal influence. [sent-40, score-0.288]
5 The measures of causal directions are derived in Section 2. [sent-46, score-0.275]
6 In fact, we can approximate tanh by a Taylor expansion 1 tanh(u) = u − u3 + O(u5 ). [sent-164, score-0.25]
7 Assuming that the relevant kurtosis is positive, which is very often the case for real data, the sign ˜ of Rc4 can be used to determine the causal direction in the same way as in the case of the likelihood ˜ approximation R in (5). [sent-176, score-0.318]
8 Thus, the cumulant-based approach allowed us to prove rigorously that a nonlinear correlation of the form (6) can be used to infer the causal direction, since it takes opposite signs under the two models. [sent-177, score-0.25]
9 This interpretation is valid for the tanh-based nonlinear correlation as well, because we can use the function h(u) = u − tanh(u) instead of tanh to measure the same correlations but with opposite sign. [sent-209, score-0.354]
10 (10) The justification for this definition is in the following theorem, which is the analogue of Theorem 1: Theorem 2 If the causal direction is x → y, we have ˜ Rc3 = skew(x)(ρ2 − ρ3 ) (11) and if the causal direction is the opposite, we have ˜ Rc3 = skew(y)(ρ3 − ρ2 ). [sent-260, score-0.418]
11 Thus, we obtain the causal ordering directly from a single matrix of nonlinear correlations, without any deflation. [sent-341, score-0.276]
12 It has the benefit of being computationally extremely simple, and it gives a simple conceptual link between causal ordering and the nonlinear correlations and cumulants. [sent-344, score-0.31]
13 Using pairwise connections makes sense if we further assume that there are no pairwise loops, that is, connections xi → x j and x j → xi are not both non-zero. [sent-436, score-0.288]
14 Other approaches to inferring the causal direction with nonlinear relations were introduced by Zhang and Hyv¨ rinen (2009), a Daniuˇis et al. [sent-449, score-0.323]
15 Since basic FastICA, which is an integral part of the method, has convergence problems with the basic tanh nonlinearity in the case of a small sample size, we used the stabilized version by Hyv¨ rinen (1999) obtained in the standard a FastICA package by the option “stabilize”. [sent-499, score-0.341]
16 Third, the nonlinear correlations in (18) were used to estimate the causal ordering without any deflation, simply by locating the minimum of the row sums of that matrix, removing the corresponding rows and columns, and so on, as described at the end of section 3. [sent-531, score-0.31]
17 The method without deflation (“nodf”) gives, by definition, the same result for the total causal directions correct and first variable found, but looking at the rank-correlations, we see that it is typically the second best. [sent-534, score-0.282]
18 Again, we fixed the sample size to T = 10, 000 and the noise variance to two (larger than above since these methods seem to be more tolerant to Gaussian noise), while the dimension and the skew data distribution were varied. [sent-566, score-0.394]
19 131 ¨ H YV ARINEN AND S MITH Mean rank correlations Total causal directions correct 1 1 0. [sent-576, score-0.316]
20 6 0 tanh nodf mxnt ICA First variable found kdir 0. [sent-584, score-0.723]
21 5 tanh nodf mxnt ICA kdir Computation time 1 100000 0. [sent-585, score-0.723]
22 2 0 100 n=2,T=100 n=5,T=100 n=5,T=200 n=5,T=400 tanh 10 nodf mxnt ICA 1 kdir tanh nodf mxnt ICA kdir Figure 3: Simulation 1. [sent-589, score-1.446]
23 132 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Mean rank correlations Total causal directions correct 1 1 0. [sent-598, score-0.316]
24 6 0 tanh nodf mxnt ICA First variable found kdir 0. [sent-606, score-0.723]
25 5 tanh nodf mxnt ICA kdir Computation time 1 100000 0. [sent-607, score-0.723]
26 2 0 100 n=2 n=3 n=5 n=8 tanh 10 nodf mxnt ICA 1 kdir tanh nodf mxnt ICA kdir Figure 4: Simulation 2, with noise. [sent-611, score-1.446]
27 Note that 133 ¨ H YV ARINEN AND S MITH Mean rank correlations Total causal directions correct 1 1 0. [sent-634, score-0.316]
28 6 0 skew skw2 mxnt ICA kdir D−R First variable found 0. [sent-642, score-0.712]
29 5 skew skw2 mxnt ICA kdir D−R Computation time 1 100000 0. [sent-643, score-0.712]
30 2 0 100 n=2,T=100,pdf 1 n=5,T=200,pdf 1 n=2,T=100,pdf 2 n=5,T=200,pdf 2 skew skw2 mxnt ICA 10 1 kdir D−R skew skw2 mxnt ICA kdir D−R Figure 5: Simulation 3, with skewed data. [sent-647, score-1.526]
31 134 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Mean rank correlations Total causal directions correct 1 1 0. [sent-655, score-0.316]
32 6 0 skew skw2 mxnt ICA kdir D−R First variable found 0. [sent-663, score-0.712]
33 5 skew skw2 mxnt ICA kdir D−R Computation time 1 100000 0. [sent-664, score-0.712]
34 2 0 100 n=2 n=3 n=5 n=8 skew skw2 mxnt ICA 10 1 kdir D−R skew skw2 mxnt ICA kdir D−R Figure 6: Simulation 4, with skewed data with noise. [sent-668, score-1.526]
35 We carried out the same simulation for skewed influences using the skewed pdf 1. [sent-672, score-0.335]
36 Furthermore, we divided the simulations into three groups: basic data (simulations 1 and 2), skewed data (simulations 3 and 4) and sparse connections either with sparse data (simulation 5) or skewed data (simulation 6). [sent-680, score-0.334]
37 135 ¨ H YV ARINEN AND S MITH Mean rank correlations Total causal directions correct 1 1 0. [sent-683, score-0.316]
38 6 0 tanh nodf mxnt icth ICA First variable found kdir 0. [sent-691, score-0.768]
39 5 tanh nodf mxnt icth ICA kdir Computation time 1 100000 0. [sent-692, score-0.768]
40 2 0 100 n=9,T=500 n=9,T=900 n=15,T=900 n=20,T=900 tanh nodf mxnt icth ICA 10 1 kdir tanh nodf mxnt icth ICA kdir Figure 7: Simulation 5, with the two-stage pruning method and only sparse graphs. [sent-696, score-1.536]
41 It is 136 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Mean rank correlations Total causal directions correct 1 1 0. [sent-710, score-0.316]
42 2 0 100 n=9,T=500 n=9,T=900 n=15,T=900 n=20,T=900 10 1 skewskw2mxnt icsk ics2 ICA kdir skewskw2mxnt icsk ics2 ICA kdir Figure 8: Simulation 6, with skewed data, the two-stage pruning method and only sparse graphs. [sent-723, score-0.508]
43 2 0 0 tanh nodf mxnt ICA kdir Sparsely connected data 1 1 0. [sent-749, score-0.723]
44 2 skew skw2 mxnt ICA kdir D−R Sparsely connected, skewed data 0. [sent-756, score-0.814]
45 2 0 tanh nodf mxnt icth ICA 0 kdir skewskw2mxnt icsk ics2 ICA kdir Figure 9: Overview of Simulations 1–6. [sent-757, score-0.971]
46 2 n=100,T=100 n=200,T=100 n=200,T=200 n=400,T=200 tanh nodf 0 mxnt tanh nodf mxnt Computation time 100000 10000 1000 100 10 1 tanh nodf mxnt Figure 10: Simulation 7, with more variables than observations. [sent-779, score-1.678]
47 Rank correlations and causal directions correct are omitted because we only computed the first two variables for lack of computation time. [sent-781, score-0.338]
48 tanh and maxent introduced above in a purely linear way (i. [sent-782, score-0.25]
49 139 ¨ H YV ARINEN AND S MITH Total causal directions correct Computation time 1 100000 0. [sent-804, score-0.282]
50 2 10 0 tanh 1 mxnt nlme hsic mad tanh mxnt nlme hsic mad Figure 12: Simulation 9, with nonlinear model. [sent-815, score-1.061]
51 The new algorithms are “nlme”, the proposed likelihood ratio method extended to the nonlinear case using maximum entropy approximation in (22); “mad”, a simplified and robustified approximation of the likelihood ratio in (24); “hsic”, the original nonlinear method using independence (Hoyer et al. [sent-816, score-0.319]
52 140 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Mean rank correlations Total causal directions correct 1 1 0. [sent-827, score-0.316]
53 6 0 tanh skew nodf mxnt ICA First variable found kdir 0. [sent-835, score-1.088]
54 5 tanh skew nodf mxnt ICA kdir Computation time 1 100000 0. [sent-836, score-1.088]
55 2 0 100 n=2,T=100 n=5,T=100 n=5,T=200 n=5,T=400 tanh skew nodf mxnt ICA 10 1 kdir tanh skew nodf mxnt ICA kdir Figure 13: Simulation 10. [sent-840, score-2.176]
56 The off-diagonal terms 142 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Mean rank correlations Total causal directions correct 1 1 0. [sent-880, score-0.316]
57 6 0 tanh nodf mxnt ICA First variable found kdir 0. [sent-888, score-0.723]
58 5 tanh nodf mxnt ICA kdir Computation time 1 100000 0. [sent-889, score-0.723]
59 2 10 0 tanh nodf mxnt ICA 1 kdir tanh nodf mxnt ICA kdir Figure 14: Simulation 11. [sent-893, score-1.446]
60 The null data was created by testing for connections between timeseries from different subjects’ data sets, which have no causal connections between them (i. [sent-1018, score-0.355]
61 75 PW−LR tanh PW−LR r skew PW−LR skew ICA−LiNGAM Patel’s Patel’s bin. [sent-1075, score-0.98]
62 75 PW−LR tanh 25 0 −5 25 0 % directions correct 50 ICA−LiNGAM 0 Patel’s 75 Patel’s bin. [sent-1076, score-0.352]
63 75 5 PW−LR tanh 100 PW−LR r skew causality (Zright − Zwrong) −5 25 −10 0 ICA−LiNGAM 50 Patel’s 0 Patel’s bin. [sent-1077, score-0.711]
64 75 75 PW−LR tanh 5 PW−LR r skew 100 % directions correct % directions correct % directions correct −5 ICA−LiNGAM 50 Patel’s 0 Patel’s bin. [sent-1078, score-0.921]
65 75 75 PW−LR tanh 5 PW−LR r skew % directions correct causality (Zright − Zwrong) 100 Simulation 12 (10 nodes, 10 minute sessions, TR=3. [sent-1079, score-0.91]
66 5s , bad ROIs (new random timeseries mixed in)) causality (Zright − Zwrong) % directions correct % directions correct 0 PW−LR r skew 25 PW−LR skew −5 10 Simulation 14 (5 nodes, 10 minute sessions, TR=3. [sent-1082, score-1.158]
67 5s , cyclic connections) % directions correct causality (Zright − Zwrong) % directions correct % directions correct causality (Zright − Zwrong) 50 PW−LR skew Patel’s bin. [sent-1085, score-0.911]
68 75 Patel’s PW−LR r skew PW−LR r skew Patel’s Patel’s −5 Patel’s 50 −10 causality (Zright − Zwrong) % directions correct Simulation 13 (5 nodes, 10 minute sessions, TR=3. [sent-1089, score-1.025]
69 5s , shared inputs) 0 PW−LR r skew 25 PW−LR tanh % directions correct causality (Zright − Zwrong) −5 0 −10 75 50 25 10 100 0 −5 Simulation 6 (10 nodes, 60 minute sessions, TR=3. [sent-1098, score-0.91]
70 5s ) 0 5 50 PW−LR skew ICA−LiNGAM Patel’s bin. [sent-1101, score-0.365]
71 75 Patel’s PW−LR r skew PW−LR tanh PW−LR tanh PW−LR skew 25 PW−LR tanh −5 PW−LR skew % directions correct causality (Zright − Zwrong) 50 PW−LR r skew ICA−LiNGAM Patel’s bin. [sent-1102, score-2.408]
72 75 Patel’s Patel’s 0 PW−LR r skew 25 PW−LR tanh −5 PW−LR skew 50 −10 causality (Zright − Zwrong) 75 0 25 −10 100 5 −5 10 Simulation 10 (5 nodes, 10 minute sessions, TR=3. [sent-1107, score-1.173]
73 5s , shared inputs) 0 PW−LR r skew 25 PW−LR tanh −5 PW−LR skew 50 −10 causality (Zright − Zwrong) 75 0 0 −10 100 5 75 10 Simulation 7 (5 nodes, 250 minute sessions, TR=3. [sent-1113, score-1.173]
74 5s ) 10 5 100 −10 100 PW−LR tanh ICA−LiNGAM Patel’s bin. [sent-1116, score-0.25]
75 75 Patel’s 0 PW−LR r skew 25 PW−LR tanh −5 PW−LR skew 50 −10 causality (Zright − Zwrong) 0 0 Simulation 5 (5 nodes, 60 minute sessions, TR=3. [sent-1117, score-1.173]
76 5s ) 100 75 causality (Zright − Zwrong) % directions correct Simulation 4 (50 nodes, 10 minute sessions, TR=3. [sent-1120, score-0.295]
77 75 0 10 −5 −10 25 Patel’s −5 PW−LR r skew 50 PW−LR tanh 0 50 10 75 PW−LR skew 5 75 0 10 Simulation 2 (10 nodes, 10 minute sessions, TR=3. [sent-1124, score-1.077]
78 5s ) 100 PW−LR skew 0 % directions correct 10 causality (Zright − Zwrong) 25 ICA−LiNGAM −5 Patel’s bin. [sent-1130, score-0.563]
79 75 50 Patel’s 0 PW−LR r skew 75 PW−LR tanh 5 PW−LR skew 100 −10 causality (Zright − Zwrong) 10 Simulation 15 (5 nodes, 10 minute sessions, TR=3. [sent-1131, score-1.173]
80 75 PW−LR tanh PW−LR r skew PW−LR skew ICA−LiNGAM Patel’s Patel’s bin. [sent-1137, score-0.98]
81 75 PW−LR tanh PW−LR r skew −5 25 0 % directions correct 50 ICA−LiNGAM 0 Patel’s 75 Patel’s bin. [sent-1138, score-0.717]
82 75 5 PW−LR tanh 100 PW−LR r skew causality (Zright − Zwrong) Simulation 24 (5 nodes, 10 minute sessions, TR=3. [sent-1139, score-0.808]
83 75 Patel’s PW−LR r skew PW−LR tanh % directions correct 0 PW−LR skew 25 −10 causality (Zright − Zwrong) −5 50 75 50 0 100 0 75 10 Simulation 26 (5 nodes, 2. [sent-1148, score-1.178]
84 75 0 10 100 PW−LR skew ICA−LiNGAM Patel’s bin. [sent-1154, score-0.365]
85 75 PW−LR r skew Patel’s Patel’s 25 Patel’s −5 −10 causality (Zright − Zwrong) % directions correct Simulation 25 (5 nodes, 5 minute sessions, TR=3. [sent-1155, score-0.66]
86 5s , 2−group test) 0 PW−LR r skew 25 10 50 PW−LR skew ICA−LiNGAM Patel’s bin. [sent-1161, score-0.73]
87 75 Patel’s PW−LR r skew PW−LR tanh −5 PW−LR tanh 50 PW−LR skew % directions correct causality (Zright − Zwrong) 75 0 0 −10 100 5 75 Simulation 18 (5 nodes, 10 minute sessions, TR=3. [sent-1162, score-1.525]
88 0s ) 0 PW−LR tanh 25 PW−LR skew −5 PW−LR r skew ICA−LiNGAM Patel’s bin. [sent-1165, score-0.98]
89 5s , stationary connection strengths) 0 PW−LR r skew 25 PW−LR tanh −5 PW−LR skew 50 −10 causality (Zright − Zwrong) 75 0 0 −10 100 5 75 10 Simulation 22 (5 nodes, 10 minute sessions, TR=3. [sent-1169, score-1.196]
90 5s , nonstationary connection strengths) 10 5 100 −10 100 PW−LR tanh ICA−LiNGAM Patel’s bin. [sent-1172, score-0.273]
91 75 Patel’s 0 PW−LR r skew 25 PW−LR tanh −5 PW−LR skew 50 −10 causality (Zright − Zwrong) 0 0 Simulation 20 (5 nodes, 10 minute sessions, TR=0. [sent-1173, score-1.173]
92 0s , neural lag=100ms) 100 75 causality (Zright − Zwrong) % directions correct Simulation 19 (5 nodes, 10 minute sessions, TR=0. [sent-1176, score-0.295]
93 5s , neural lag=100ms) 5 25 PW−LR skew ICA−LiNGAM Patel’s bin. [sent-1179, score-0.365]
94 75 0 10 −5 −10 25 Patel’s −5 PW−LR r skew 50 PW−LR tanh 0 50 10 75 PW−LR skew 5 75 0 10 Simulation 17 (10 nodes, 10 minute sessions, TR=3. [sent-1180, score-1.077]
95 5s , more connections) 100 PW−LR skew 0 % directions correct 10 causality (Zright − Zwrong) 25 ICA−LiNGAM −5 Patel’s bin. [sent-1186, score-0.563]
96 75 50 Patel’s 0 PW−LR r skew 75 PW−LR tanh 5 PW−LR skew 100 −10 causality (Zright − Zwrong) 10 Simulation 28 (5 nodes, 5 minute sessions, TR=3. [sent-1187, score-1.173]
97 148 PAIRWISE L IKELIHOOD R ATIOS FOR N ON -G AUSSIAN SEM S Method mxnt hsic nlme mad % correct 50. [sent-1192, score-0.287]
98 Conclusion We proposed very simple measures of the pairwise causal direction based on likelihood ratio tests and their approximations. [sent-1204, score-0.396]
99 Pairwise measures of causal direction in linear non-gaussian acyclic models. [sent-1349, score-0.261]
100 Multi-subject search correctly identifies causal connections and most causal directions in the DCM models of the Smith et al. [sent-1411, score-0.501]
wordName wordTfidf (topN-words)
[('skew', 0.365), ('pw', 0.323), ('lingam', 0.288), ('ica', 0.275), ('lr', 0.269), ('tanh', 0.25), ('patel', 0.231), ('causal', 0.18), ('mxnt', 0.176), ('kdir', 0.171), ('hrfstd', 0.126), ('nodf', 0.126), ('sessions', 0.126), ('zright', 0.126), ('zwrong', 0.126), ('cum', 0.122), ('shimizu', 0.108), ('skewed', 0.102), ('directlingam', 0.099), ('minute', 0.097), ('causality', 0.096), ('arinen', 0.095), ('mith', 0.095), ('sem', 0.092), ('atios', 0.09), ('simulation', 0.083), ('ikelihood', 0.081), ('hyv', 0.074), ('yv', 0.073), ('connections', 0.072), ('pairwise', 0.072), ('directions', 0.069), ('rinen', 0.061), ('aussian', 0.06), ('dodge', 0.059), ('simulations', 0.058), ('hoyer', 0.054), ('nonlinear', 0.053), ('likelihood', 0.05), ('disturbances', 0.05), ('cyclic', 0.048), ('pdf', 0.048), ('tr', 0.047), ('cumulants', 0.046), ('fmri', 0.045), ('icth', 0.045), ('kurt', 0.045), ('ordering', 0.043), ('standardized', 0.043), ('skewness', 0.042), ('ratio', 0.039), ('kurtosis', 0.036), ('rousson', 0.036), ('entropy', 0.035), ('cumulant', 0.035), ('correlations', 0.034), ('correct', 0.033), ('uences', 0.032), ('hsic', 0.032), ('icsk', 0.032), ('skewnesses', 0.032), ('timeseries', 0.031), ('nodes', 0.03), ('nonlinearity', 0.03), ('smith', 0.029), ('direction', 0.029), ('noise', 0.029), ('dag', 0.028), ('directionalities', 0.027), ('disturbance', 0.027), ('brain', 0.027), ('measures', 0.026), ('acyclic', 0.026), ('strengths', 0.026), ('ation', 0.026), ('external', 0.025), ('mutual', 0.025), ('gd', 0.024), ('mad', 0.023), ('connection', 0.023), ('residuals', 0.023), ('sign', 0.023), ('yt', 0.023), ('balloon', 0.023), ('gskew', 0.023), ('nlme', 0.023), ('sogawa', 0.023), ('variables', 0.022), ('subjects', 0.021), ('legend', 0.021), ('gx', 0.021), ('mooij', 0.019), ('entropies', 0.019), ('latent', 0.019), ('mi', 0.018), ('dcm', 0.018), ('informations', 0.018), ('refute', 0.018), ('xt', 0.017), ('opposite', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
Author: Aapo Hyvärinen, Stephen M. Smith
Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis
2 0.089875937 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb
Abstract: Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. Keywords: classification, naive Bayes, attribute independence assumption, weighted naive Bayes classification
3 0.070904106 41 jmlr-2013-Experiment Selection for Causal Discovery
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Randomized controlled experiments are often described as the most reliable tool available to scientists for discovering causal relationships among quantities of interest. However, it is often unclear how many and which different experiments are needed to identify the full (possibly cyclic) causal structure among some given (possibly causally insufficient) set of variables. Recent results in the causal discovery literature have explored various identifiability criteria that depend on the assumptions one is able to make about the underlying causal process, but these criteria are not directly constructive for selecting the optimal set of experiments. Fortunately, many of the needed constructions already exist in the combinatorics literature, albeit under terminology which is unfamiliar to most of the causal discovery community. In this paper we translate the theoretical results and apply them to the concrete problem of experiment selection. For a variety of settings we give explicit constructions of the optimal set of experiments and adapt some of the general combinatorics results to answer questions relating to the problem of experiment selection. Keywords: causality, randomized experiments, experiment selection, separating systems, completely separating systems, cut-coverings
4 0.046953492 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
Author: Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson
Abstract: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine. Keywords: causation, counterfactual reasoning, computational advertising
5 0.040151626 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
6 0.035162266 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
7 0.028901737 106 jmlr-2013-Stationary-Sparse Causality Network Learning
8 0.028657712 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
9 0.028438631 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
10 0.025620315 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
11 0.025465544 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
12 0.024513822 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
13 0.023815399 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
14 0.023278426 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
15 0.020768249 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
16 0.02068799 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
17 0.020065455 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
18 0.02005405 15 jmlr-2013-Bayesian Canonical Correlation Analysis
19 0.019979911 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
20 0.018888963 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
topicId topicWeight
[(0, -0.103), (1, -0.028), (2, 0.001), (3, 0.024), (4, 0.076), (5, 0.05), (6, -0.029), (7, 0.019), (8, -0.012), (9, 0.061), (10, -0.023), (11, -0.079), (12, -0.061), (13, -0.205), (14, -0.008), (15, 0.101), (16, -0.034), (17, 0.029), (18, 0.042), (19, 0.111), (20, 0.122), (21, 0.083), (22, 0.176), (23, 0.2), (24, -0.369), (25, -0.044), (26, -0.076), (27, 0.077), (28, -0.076), (29, 0.259), (30, -0.048), (31, 0.056), (32, -0.09), (33, 0.118), (34, -0.062), (35, -0.114), (36, 0.094), (37, -0.089), (38, -0.039), (39, 0.109), (40, 0.04), (41, 0.037), (42, 0.148), (43, -0.008), (44, -0.103), (45, 0.134), (46, -0.007), (47, -0.031), (48, 0.039), (49, -0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.94992858 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
Author: Aapo Hyvärinen, Stephen M. Smith
Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis
2 0.68731296 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb
Abstract: Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. Keywords: classification, naive Bayes, attribute independence assumption, weighted naive Bayes classification
3 0.40809688 41 jmlr-2013-Experiment Selection for Causal Discovery
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Randomized controlled experiments are often described as the most reliable tool available to scientists for discovering causal relationships among quantities of interest. However, it is often unclear how many and which different experiments are needed to identify the full (possibly cyclic) causal structure among some given (possibly causally insufficient) set of variables. Recent results in the causal discovery literature have explored various identifiability criteria that depend on the assumptions one is able to make about the underlying causal process, but these criteria are not directly constructive for selecting the optimal set of experiments. Fortunately, many of the needed constructions already exist in the combinatorics literature, albeit under terminology which is unfamiliar to most of the causal discovery community. In this paper we translate the theoretical results and apply them to the concrete problem of experiment selection. For a variety of settings we give explicit constructions of the optimal set of experiments and adapt some of the general combinatorics results to answer questions relating to the problem of experiment selection. Keywords: causality, randomized experiments, experiment selection, separating systems, completely separating systems, cut-coverings
4 0.29576904 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
Author: Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson
Abstract: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine. Keywords: causation, counterfactual reasoning, computational advertising
5 0.19177586 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library
Author: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray
Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. Keywords: C++, dual-tree algorithms, machine learning software, open source software, largescale learning 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efficient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scalability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg MemoryBased Learner (TiMBL) are highly scalable and accessible yet each offer only a single method. MLPACK, intended to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to combine efficiency and accessibility. Written in C++, MLPACK uses the highly efficient Armadillo matrix library (Sanderson, 2010) and is freely available under the GNU Lesser General Public License (LGPL). Through the use of C++ templates, MLPACK both eliminates unnecessary copying of data sets and performs expression optimizations unavailable in other languages. Also, MLPACK is, to our knowledge, unique among existing libraries in using generic programming features of C++ to allow customization of the available machine learning methods without incurring performance penalties. c 2013 Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta and Alexander G. Gray. C URTIN , C LINE , S LAGLE , M ARCH , R AM , M EHTA AND G RAY In addition, users ranging from students to experts should find the consistent, intuitive interface of MLPACK to be highly accessible. Finally, the source code provides references and comprehensive documentation. Four major goals of the development team of MLPACK are • • • • to implement scalable, fast machine learning algorithms, to design an intuitive, consistent, and simple API for non-expert users, to implement a variety of machine learning methods, and to provide cutting-edge machine learning algorithms unavailable elsewhere. This paper offers both an introduction to the simple and extensible API and a glimpse of the superior performance of the library. 2. Package Overview Each algorithm available in MLPACK features both a set of C++ library functions and a standalone command-line executable. Version 1.0.3 includes the following methods: • • • • • • • • • • • • • • nearest/furthest neighbor search with cover trees or kd-trees (k-nearest-neighbors) range search with cover trees or kd-trees Gaussian mixture models (GMMs) hidden Markov models (HMMs) LARS / Lasso regression k-means clustering fast hierarchical clustering (Euclidean MST calculation)1 (March et al., 2010) kernel PCA (and regular PCA) local coordinate coding1 (Yu et al., 2009) sparse coding using dictionary learning RADICAL (Robust, Accurate, Direct ICA aLgorithm) (Learned-Miller and Fisher, 2003) maximum variance unfolding (MVU) via LRSDP1 (Burer and Monteiro, 2003) the naive Bayes classifier density estimation trees1 (Ram and Gray, 2011) The development team manages MLPACK with Subversion and the Trac bug reporting system, allowing easy downloads and simple bug reporting. The entire development process is transparent, so any interested user can easily contribute to the library. MLPACK can compile from source on Linux, Mac OS, and Windows; currently, different Linux distributions are reviewing MLPACK for inclusion in their package managers, which will allow users to install MLPACK without needing to compile from source. 3. A Consistent, Simple API MLPACK features a highly accessible API, both in style (such as consistent naming schemes and coding conventions) and ease of use (such as templated defaults), as well as stringent documentation standards. Consequently, a new user can execute algorithms out-of-the-box often with little or no adjustment to parameters, while the seasoned expert can expect extreme flexibility in algorithmic 1. This algorithm is not available in any other comparable software package. 802 MLPACK: A S CALABLE C++ M ACHINE L EARNING L IBRARY Data Set wine cloud wine-qual isolet miniboone yp-msd corel covtype mnist randu MLPACK 0.0003 0.0069 0.0290 13.0197 20.2045 5430.0478 4.9716 14.3449 2719.8087 1020.9142 Weka 0.0621 0.1174 0.8868 213.4735 216.1469 >9000.0000 14.4264 45.9912 >9000.0000 2665.0921 Shogun 0.0277 0.5000 4.3617 37.6190 2351.4637 >9000.0000 555.9600 >9000.0000 3536.4477 >9000.0000 MATLAB 0.0021 0.0210 0.6465 46.9518 1088.1127 >9000.0000 60.8496 >9000.0000 4838.6747 1679.2893 mlpy 0.0025 0.3520 4.0431 52.0437 3219.2696 >9000.0000 209.5056 >9000.0000 5192.3586 >9000.0000 sklearn 0.0008 0.0192 0.1668 46.8016 714.2385 >9000.0000 160.4597 651.6259 5363.9650 8780.0176 Table 1: k-NN benchmarks (in seconds). Data Set UCI Name Size wine Wine 178x13 cloud Cloud 2048x10 wine-qual Wine Quality 6497x11 isolet ISOLET 7797x617 miniboone MiniBooNE 130064x50 Data Set UCI Name Size yp-msd YearPredictionMSD 515345x90 corel Corel 37749x32 covtype Covertype 581082x54 mnist N/A 70000x784 randu N/A 1000000x10 Table 2: Benchmark data set sizes. tuning. For example, the following line initializes an object which will perform the standard kmeans clustering in Euclidean space: KMeans
6 0.18720731 106 jmlr-2013-Stationary-Sparse Causality Network Learning
7 0.1775398 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
8 0.16888791 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
9 0.16624486 82 jmlr-2013-Optimally Fuzzy Temporal Memory
10 0.16531166 15 jmlr-2013-Bayesian Canonical Correlation Analysis
11 0.14910033 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
12 0.14767987 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
13 0.14368357 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
14 0.14082772 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
15 0.12951532 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
16 0.1258439 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
17 0.12476891 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
18 0.12282299 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
19 0.11936657 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
20 0.10547532 104 jmlr-2013-Sparse Single-Index Model
topicId topicWeight
[(0, 0.023), (5, 0.093), (6, 0.037), (10, 0.037), (20, 0.014), (23, 0.021), (44, 0.552), (68, 0.017), (70, 0.019), (75, 0.045), (87, 0.02), (89, 0.014)]
simIndex simValue paperId paperTitle
1 0.95488882 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library
Author: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray
Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. Keywords: C++, dual-tree algorithms, machine learning software, open source software, largescale learning 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efficient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scalability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg MemoryBased Learner (TiMBL) are highly scalable and accessible yet each offer only a single method. MLPACK, intended to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to combine efficiency and accessibility. Written in C++, MLPACK uses the highly efficient Armadillo matrix library (Sanderson, 2010) and is freely available under the GNU Lesser General Public License (LGPL). Through the use of C++ templates, MLPACK both eliminates unnecessary copying of data sets and performs expression optimizations unavailable in other languages. Also, MLPACK is, to our knowledge, unique among existing libraries in using generic programming features of C++ to allow customization of the available machine learning methods without incurring performance penalties. c 2013 Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta and Alexander G. Gray. C URTIN , C LINE , S LAGLE , M ARCH , R AM , M EHTA AND G RAY In addition, users ranging from students to experts should find the consistent, intuitive interface of MLPACK to be highly accessible. Finally, the source code provides references and comprehensive documentation. Four major goals of the development team of MLPACK are • • • • to implement scalable, fast machine learning algorithms, to design an intuitive, consistent, and simple API for non-expert users, to implement a variety of machine learning methods, and to provide cutting-edge machine learning algorithms unavailable elsewhere. This paper offers both an introduction to the simple and extensible API and a glimpse of the superior performance of the library. 2. Package Overview Each algorithm available in MLPACK features both a set of C++ library functions and a standalone command-line executable. Version 1.0.3 includes the following methods: • • • • • • • • • • • • • • nearest/furthest neighbor search with cover trees or kd-trees (k-nearest-neighbors) range search with cover trees or kd-trees Gaussian mixture models (GMMs) hidden Markov models (HMMs) LARS / Lasso regression k-means clustering fast hierarchical clustering (Euclidean MST calculation)1 (March et al., 2010) kernel PCA (and regular PCA) local coordinate coding1 (Yu et al., 2009) sparse coding using dictionary learning RADICAL (Robust, Accurate, Direct ICA aLgorithm) (Learned-Miller and Fisher, 2003) maximum variance unfolding (MVU) via LRSDP1 (Burer and Monteiro, 2003) the naive Bayes classifier density estimation trees1 (Ram and Gray, 2011) The development team manages MLPACK with Subversion and the Trac bug reporting system, allowing easy downloads and simple bug reporting. The entire development process is transparent, so any interested user can easily contribute to the library. MLPACK can compile from source on Linux, Mac OS, and Windows; currently, different Linux distributions are reviewing MLPACK for inclusion in their package managers, which will allow users to install MLPACK without needing to compile from source. 3. A Consistent, Simple API MLPACK features a highly accessible API, both in style (such as consistent naming schemes and coding conventions) and ease of use (such as templated defaults), as well as stringent documentation standards. Consequently, a new user can execute algorithms out-of-the-box often with little or no adjustment to parameters, while the seasoned expert can expect extreme flexibility in algorithmic 1. This algorithm is not available in any other comparable software package. 802 MLPACK: A S CALABLE C++ M ACHINE L EARNING L IBRARY Data Set wine cloud wine-qual isolet miniboone yp-msd corel covtype mnist randu MLPACK 0.0003 0.0069 0.0290 13.0197 20.2045 5430.0478 4.9716 14.3449 2719.8087 1020.9142 Weka 0.0621 0.1174 0.8868 213.4735 216.1469 >9000.0000 14.4264 45.9912 >9000.0000 2665.0921 Shogun 0.0277 0.5000 4.3617 37.6190 2351.4637 >9000.0000 555.9600 >9000.0000 3536.4477 >9000.0000 MATLAB 0.0021 0.0210 0.6465 46.9518 1088.1127 >9000.0000 60.8496 >9000.0000 4838.6747 1679.2893 mlpy 0.0025 0.3520 4.0431 52.0437 3219.2696 >9000.0000 209.5056 >9000.0000 5192.3586 >9000.0000 sklearn 0.0008 0.0192 0.1668 46.8016 714.2385 >9000.0000 160.4597 651.6259 5363.9650 8780.0176 Table 1: k-NN benchmarks (in seconds). Data Set UCI Name Size wine Wine 178x13 cloud Cloud 2048x10 wine-qual Wine Quality 6497x11 isolet ISOLET 7797x617 miniboone MiniBooNE 130064x50 Data Set UCI Name Size yp-msd YearPredictionMSD 515345x90 corel Corel 37749x32 covtype Covertype 581082x54 mnist N/A 70000x784 randu N/A 1000000x10 Table 2: Benchmark data set sizes. tuning. For example, the following line initializes an object which will perform the standard kmeans clustering in Euclidean space: KMeans
same-paper 2 0.75580198 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
Author: Aapo Hyvärinen, Stephen M. Smith
Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis
3 0.65151399 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
4 0.2665807 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco
Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205
5 0.24262416 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang
Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox
6 0.23515636 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
7 0.22526167 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
8 0.21908484 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
10 0.20527866 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
11 0.20523024 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
12 0.20314452 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
13 0.20202802 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
14 0.20190591 120 jmlr-2013-Variational Algorithms for Marginal MAP
15 0.20169726 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
16 0.20129493 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
18 0.20086482 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
19 0.20058188 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
20 0.20053419 59 jmlr-2013-Large-scale SVD and Manifold Learning