jmlr jmlr2009 jmlr2009-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin
Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artificial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning
Reference: text
sentIndex sentText sentNum sentScore
1 recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. [sent-12, score-1.374]
2 We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. [sent-16, score-0.53]
3 Keywords: artificial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning 1. [sent-18, score-0.671]
4 However, complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elc 2009 Hugo Larochelle, Yoshua Bengio, J´ rˆ me Louradour and Pascal Lamblin. [sent-23, score-0.53]
5 Each layer in a multi-layer neural network can be seen as a representation of the input obtained through a learned transformation. [sent-29, score-0.561]
6 The idea of using unsupervised learning at each stage of a deep network was recently put forward by Hinton et al. [sent-44, score-0.712]
7 (2006), as part of a training procedure for the deep belief network (DBN), a generative model with many layers of hidden stochastic variables. [sent-45, score-1.107]
8 Upper layers of a DBN are supposed to represent more “abstract” concepts that explain the input observation x, whereas lower layers extract “low-level features” from x. [sent-46, score-0.539]
9 Hinton (2006) showed that stacking restricted Boltzmann machines (RBMs)— that is, training upper RBMs on the distribution of activities computed by lower RBMs—provides a good initialization strategy for the weights of a deep artificial neural network. [sent-49, score-0.538]
10 , 2007), and in the deep convolutional neural network (Ranzato et al. [sent-52, score-0.538]
11 In this paper, we discuss in detail three principles for training deep neural networks and present experimental evidence that highlight the role of each in successfully training deep networks: 1. [sent-63, score-0.998]
12 using unsupervised learning at each layer in a way that preserves information from the input and disentangles factors of variation; 3. [sent-65, score-0.595]
13 We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and aspects of their topology such as depth and the size of the layers. [sent-68, score-0.53]
14 A deep neural network contains an input layer and an output layer, separated by l layers of hidden units. [sent-73, score-1.372]
15 Given an input sample clamped to the input layer, the other units of the network compute their values according to the activity of the units that they are connected to in the layers below. [sent-74, score-0.697]
16 We will consider a particular sort of topology here, where the input layer is fully connected to the first hidden layer, which is fully connected to the second layer and so on up to the output layer. [sent-75, score-0.934]
17 Given an input x, the value of the j-th unit in the i-th layer is denoted hij (x), with i = 0 referring to the input layer, i = l + 1 referring to the output layer (the use of “ ” will become clearer in Section 4). [sent-76, score-0.833]
18 The set of weights W jk between hk (x) in layer i − 1 and unit hij (x) in layer i determines the activation of unit hij (x) as follows: i i−1 hij (x) = sigm aij where aij (x) = bij + ∑ W jk hk (x) ∀i ∈ {1, . [sent-79, score-1.053]
19 Given the last hidden layer, the output layer is computed similarly by o(x) = hl+1 (x) = f al+1 (x) where al+1 (x) = bl+1 + Wl+1 hl (x) where the activation function f (·) depends on the (supervised) task the network must achieve. [sent-83, score-0.745]
20 3 (2) L AROCHELLE , B ENGIO , L OURADOUR AND L AMBLIN 4 o(x) bj W4 ^ h3(x) b3 j W3 ^2 h (x) 2 bj W2 ^1 h (x) 1 bj W1 b0 j x Figure 1: Illustration of a deep network and its parameters. [sent-85, score-0.724]
21 When an input sample x is presented to the network, the application of Equation 1 at each layer will generate a pattern of activity in the different layers of the neural network. [sent-86, score-0.666]
22 Deep Neural Networks It has been shown that a “shallow” neural network with only one arbitrarily large hidden layer could approximate a function to any level of precision (Hornik et al. [sent-93, score-0.679]
23 Unfortunately, deep networks trained in that manner have generally been found to perform worse than neural networks with one or two hidden layers. [sent-112, score-0.802]
24 Many semi-supervised learning algorithms also involve a combination of unsupervised and supervised learning, where the unsupervised component can be applied to additional unlabelled data. [sent-140, score-0.517]
25 The greedy layerwise unsupervised strategy provides an initialization procedure, after which the neural network is fine-tuned to the global supervised objective. [sent-151, score-0.527]
26 In the first phase, greedily train subsets of the parameters of the network using a layerwise and unsupervised learning criterion, by repeating the following steps for each layer (i ∈ {1, . [sent-153, score-0.682]
27 , l}) Until a stopping criteria is met, iterate through training database by (a) mapping input training sample xt to representation hi−1 (xt ) (if i > 1) and hidden representation hi (xt ), (b) updating parameters bi−1 , bi and Wi of layer i using some unsupervised learning algorithm. [sent-156, score-1.16]
28 , 0 for weight decay), the first phase here will ensure that the parameter solution for each layer found by fine-tuning will not be far from the solution found by the unsupervised learning algorithm. [sent-170, score-0.609]
29 Stacked Restricted Boltzmann Machine Network Intuitively, a successful learning algorithm for deep networks should be one that discovers a meaningful and possibly complex hidden representation of the data at its top hidden layer. [sent-174, score-0.886]
30 Figure 3 for an illustration), given an input x, it is easy to obtain a hidden representation for that input by computing the posterior h(x) over the layer of binary hidden variables h (we use the “ ” symbol to emphasize that h(x) is not a random variable but a deterministic representation of x). [sent-179, score-0.889]
31 Hinton (2006) argues that this representation can be improved by giving it as input to another RBM, whose posterior over its hidden layer will then provide a more complex representation of the input. [sent-180, score-0.645]
32 Finally, the parameters of the RBMs that compute these representations can be used to initialize the parameters of a deep network, which can then be fine-tuned to a particular supervised task. [sent-182, score-0.512]
33 We will refer to deep networks trained using this algorithm as stacked restricted Boltzmann machine (SRBM) networks. [sent-185, score-0.579]
34 An additional hypothesis to explain why this process provides a good initialization for the network is that it makes each hidden layer compute a different, possibly more abstract representation of the input. [sent-190, score-0.712]
35 This is done implicitly, by asking that each layer captures features of the input that help characterize the distribution of values at the layer below. [sent-191, score-0.739]
36 However, there may be other non-linear, unsupervised learning models that, when stacked, are able to improve the learned representation at the last layer added. [sent-195, score-0.574]
37 In this paper, we will consider autoassociator networks of only one hidden layer, meaning that the hidden 8 E XPLORING S TRATEGIES FOR T RAINING D EEP N EURAL N ETWORKS ^ x ck W* ^ bj h(x) W x Figure 4: Illustration of an autoassociator and its parameters. [sent-199, score-0.864]
38 By noticing the similarity between Equations 3 and 1, we are then able to use the training algorithm for autoassociators as the unsupervised learning algorithm for the greedy layer-wise initialization phase of deep networks. [sent-206, score-0.91]
39 In this paper, stacked autoassociators (SAA) networks will refer to deep networks trained using the procedure of Section 3. [sent-207, score-0.745]
40 However, in that particular case, some care must be taken so that the network does not learn a trivial identity function, that is, finds weights that simply “copy” the whole input vector in the hidden layer and then copy it again at the output. [sent-211, score-0.729]
41 For example, a network with small weights W jk between the input and hidden layers (maintaining activations in the linear regime of the activation ∗ function f ) and large weights W jk between the hidden and output layers could encode such an uninteresting identity function. [sent-212, score-1.2]
42 Experiments In this section, we present several experiments set up to evaluate the deep network learning algorithms that fall in the paradigm presented in the Section 3. [sent-240, score-0.511]
43 (2007), we performed experiments on two regression data sets, with non-image continuous inputs (UCI Abalone, and a financial data set), demonstrating the use of unsupervised (or partially supervised) pre-training of deep networks on these tasks. [sent-256, score-0.7]
44 On these tasks, deep networks compared favorably to shallow architectures. [sent-259, score-0.552]
45 In other words, this variant simply puts away the pre-training phase of the other deep network learning algorithms. [sent-299, score-0.574]
46 We greedily pre-train the layers using a supervised criterion (instead of the unsupervised one), before performing as before a final supervised fine-tuning phase. [sent-303, score-0.624]
47 When the training of a layer is finished, we can simply discard the parameters V i and ci and move to pre-training the next hidden layer, having initialized Wi and bi . [sent-307, score-0.64]
48 This model can be used to initialize the weights Wi and biases bi of the i-th hidden layer of a deep network. [sent-314, score-1.052]
49 However, because W in Equation 5 is square, the deep network will need to have hidden layers with the same size as the input layer. [sent-315, score-1.0]
50 The stacked logistic autoregression network will refer to deep networks using this unsupervised layer-wise learning algorithm. [sent-317, score-0.849]
51 We also give results for a “shallow”, one hidden layer neural network, to validate the utility of deep architectures. [sent-321, score-0.966]
52 Instead of the sigmoid, this network uses hyperbolic tangent squashing functions, which are usually found to work better for one hidden layer neural networks. [sent-322, score-0.679]
53 The SRBM and SAA networks had 500, 500 and 2000 hidden units in the first, second and third layers respectively, as in Hinton et al. [sent-326, score-0.63]
54 For the deep networks with supervised or no pre-training, different sizes of hidden layers were compared, including sizes similar to the stacked logistic autoregression network, and to the SRBM and SAA networks. [sent-329, score-1.065]
55 the deep network with supervised layer-wise pre-training particularly highlights the importance of unsupervised learning. [sent-353, score-0.801]
56 Indeed, even though supervised layer-wise pre-training explicitly trains the hidden layers to capture non-linear information about the input, the overall procedure seems to be too greedy with respect to the supervised task to be learned. [sent-354, score-0.684]
57 The explanatory hypothesis we evaluate here is that, without pre-training, the lower layers are initialized poorly, but still allow the top two layers to learn the training set almost perfectly because the output layer and the last hidden layer form a standard shallow but fat neural network. [sent-364, score-1.538]
58 Consider the top two layers of the deep network with pre-training: it presumably takes as input a better representation, one that allows for better generalization. [sent-365, score-0.805]
59 To test this hypothesis, we performed a second series of experiments in which we constrain the top hidden layer to be small (20 hidden units). [sent-367, score-0.735]
60 With no pre-training, training error degrades significantly when there are only 20 hidden units in the top hidden layer. [sent-371, score-0.563]
61 Figures 6 and 7 show the sorts of first hidden layer features (weights going into different hidden neurons) that are learned by the first (bottom) RBM and autoassociator respectively, before finetuning. [sent-391, score-0.89]
62 Even if the supervised fine-tuning gradient at the first hidden layer is weak, we can see that the first hidden layer appears to learn a relevant representation. [sent-397, score-1.236]
63 2 Exploring the Space of Network Architectures An important practical aspect in using deep network is the choice the architecture or topology of the network. [sent-399, score-0.511]
64 First, we would like to know how deep a neural network can be made while still obtaining generalization gains, given a strategy for initializing its parameters (randomly or with unsupervised greedy pre-training). [sent-401, score-0.805]
65 The activation of units of the first hidden layer is obtained by a dot product of such a weight “image” with the input image. [sent-435, score-0.735]
66 1 N ETWORK D EPTH One can wonder whether a neural network can be made too deep, that is, whether having too many hidden layers can worsen the generalization performance. [sent-479, score-0.579]
67 On the other hand, it is less clear to what extent the performance can worsen, since a neuron added at the top layer of a neural network does not increase the capacity the same way a neuron added “in parallel” in a given hidden layer. [sent-482, score-0.679]
68 Also, in the case of an SRBM network, we can imagine that as we stack RBMs, the representation at a hidden layer contains units that correspond to more and more disentangled concepts of the input. [sent-483, score-0.689]
69 Now, consider a hypothetical deep network where the top-level stacked RBM has learned a representation made of units that are mostly independent. [sent-484, score-0.728]
70 Table 3 presents the classification performance obtained by the different deep networks with up to 4 hidden layers on MNIST-small and MNIST-rotation. [sent-542, score-0.908]
71 As for MNIST-rotation, the size of each hidden layer had to be validated separately for each layer, and we tested values among 500, 1000, 2000 and 4000. [sent-548, score-0.54]
72 Table 3 show that there is indeed an optimal number of hidden layers for the deep networks, and that this optimum tends to be larger when unsupervised greedy layer-wise learning is used. [sent-549, score-1.106]
73 For the MNIST-small data set (Table 3), the gain in performance between 2 and 3 hidden layers for SRBM and SAA networks is not statistically significant. [sent-550, score-0.509]
74 The improvement remains significant when fixing the network’s hidden layers to the same size as in the experiments on MNIST-small, as showed in the results of Table 4 where the number of units per hidden layer was set to 1000. [sent-553, score-1.101]
75 We also compared the performance of shallow and deep SRBM networks with roughly the same number of parameters. [sent-554, score-0.552]
76 Every time one wants to train a 4 hidden layer network, networks with 1, 2 and 3 hidden layers effectively have to be trained as well, in order to determine appropriate hyperparameters for the lower hidden layers. [sent-575, score-1.361]
77 Moreover, the optimal hidden layer size for a 1-hidden layer network could be much bigger than necessary for a 4 hidden layer network, since a shallow network cannot rely on other upper layers to increase its capacity. [sent-577, score-1.978]
78 Let us consider the situation where the number of hidden layers of a deep network has already been chosen and good sizes of the different layers must be found. [sent-578, score-1.196]
79 It might be that the size of a hidden layer has a significant influence on the optimum value for these hyperparameters, and that tying them for all hidden layers induces a bias towards networks with equally-sized hidden layers. [sent-584, score-1.244]
80 (2005) also show how to derive RBMs from arbitrary choices of exponential distributions for the visible and hidden layers of an RBM. [sent-611, score-0.524]
81 3, such extensions have a very significant impact on nature of the solution learned for the RBM’s weights and hence on the initialization of a deep network and its performance. [sent-614, score-0.571]
82 1 Linear Energy: Exponential or Truncated Exponential Consider a unit with value xk in an RBM, connected to units h of the layer above. [sent-616, score-0.506]
83 In this case the variance is unconditional, whereas the mean depends on the inputs 2 of the unit: for a visible unit xk with hidden layer h and inverse variance dk , E [xk |h] = a(h) . [sent-672, score-0.67]
84 3 Impact on Classification Performance In order to assess the impact of the choice for the visible layer distribution on the ultimate performance of an SRBM network, we trained and compared different deep networks whose first level RBM had binary, truncated exponential or Gaussian input units. [sent-681, score-1.015]
85 These networks all had 3 hidden layers, with 2000 hidden units for each of these layers. [sent-682, score-0.58]
86 26 E XPLORING S TRATEGIES FOR T RAINING D EEP N EURAL N ETWORKS is having to decide the number of unsupervised training iterations for each layer before starting the fine-tuning. [sent-812, score-0.598]
87 One possibility is then to execute all phases simultaneously, that is, train all layers based on both their greedy unsupervised and global supervised gradients. [sent-813, score-0.625]
88 In the first phase, all layers of networks were simultaneously trained according to their unsupervised criterion without fine-tuning. [sent-823, score-0.558]
89 As a baseline for the second phase, we also give the performance of the networks when unsupervised learning is stopped and only the parameters of the output layer are trained. [sent-832, score-0.615]
90 During the first half of training, all hidden layers are trained according to CD and the output layer is trained according to the supervised objective, for all curves. [sent-842, score-0.96]
91 In the second phase, all combinations of two possibilities are displayed: CD training is performed at all hidden layers (“CD”) or not (“No CD”), and all hidden layers are fine-tuned according to the supervised objective (“hidden supervised fine-tuning”) or not (“no hidden supervised fine-tuning”). [sent-843, score-1.394]
92 These experiments show that one can eliminate the multiple unsupervised phases: each layer can be pre-trained in a way that simply ignores what the layer above are doing. [sent-867, score-0.891]
93 Finally, better model selection techniques that would permit to reduce the number of hyperparameters would be beneficial and will need to be developed for deep network learning algorithms to become easier to use. [sent-883, score-0.561]
94 1 Restricted Boltzmann Machine A restricted Boltzmann machine is an energy-based generative model defined over a visible layer v (sometimes called input) and a hidden layer h (sometimes called hidden factors or representation). [sent-913, score-1.199]
95 4 Deep Belief Network We wish to make a quick remark on the distinction between the SRBM network and the more widely known deep belief network (DBN) (Hinton et al. [sent-963, score-0.667]
96 It actually corresponds to a sigmoid belief network (Neal, 1992) of l − 1 hidden layers, where the prior over its top hidden layer hl−1 (second factor of Equation 12) is an RBM, which itself has a hidden layer h l . [sent-968, score-1.459]
97 More precisely, it defines a distribution over an input layer x and l layers of binary stochastic units h i as follows: p(x, h1 , . [sent-969, score-0.76]
98 , hl ) = l−1 ∏ p(hi−1 |hi ) p(hl−1 , hl ) (12) i=1 where hidden units are conditionally independent given the units in the above layer i−1 p(hi−1 |hi ) = ∏ p(hk |hi ) . [sent-972, score-0.918]
99 We emphasize the distinction between hi and hi (x), where the former is a random variable and the latter is the representation of an input x at the i-th hidden layer of the network obtained from the repeated application of Equation 1. [sent-976, score-0.883]
100 Training MLPs layer by layer using an objective function for e e internal representations. [sent-1215, score-0.69]
wordName wordTfidf (topN-words)
[('deep', 0.399), ('layer', 0.345), ('layers', 0.245), ('srbm', 0.226), ('rbm', 0.22), ('unsupervised', 0.201), ('hidden', 0.195), ('autoassociator', 0.155), ('hinton', 0.138), ('eep', 0.136), ('amblin', 0.129), ('arochelle', 0.129), ('ouradour', 0.129), ('trategies', 0.123), ('xploring', 0.123), ('units', 0.121), ('saa', 0.115), ('network', 0.112), ('engio', 0.11), ('eural', 0.104), ('rbms', 0.103), ('autoassociators', 0.097), ('sigm', 0.097), ('etworks', 0.093), ('larochelle', 0.09), ('supervised', 0.089), ('mnist', 0.086), ('salakhutdinov', 0.086), ('xt', 0.085), ('shallow', 0.084), ('raining', 0.083), ('contrastive', 0.078), ('hi', 0.077), ('bj', 0.071), ('networks', 0.069), ('stacked', 0.068), ('hl', 0.068), ('gradient', 0.067), ('greedy', 0.066), ('boltzmann', 0.063), ('phase', 0.063), ('reconstruction', 0.06), ('generative', 0.06), ('geoffrey', 0.059), ('visible', 0.059), ('bengio', 0.057), ('training', 0.052), ('hyperparameters', 0.05), ('input', 0.049), ('bi', 0.048), ('architectures', 0.047), ('energy', 0.046), ('yoshua', 0.046), ('hij', 0.045), ('oyt', 0.045), ('dbn', 0.044), ('belief', 0.044), ('ranzato', 0.044), ('pixel', 0.043), ('trained', 0.043), ('al', 0.04), ('xk', 0.04), ('vk', 0.039), ('ruslan', 0.039), ('jk', 0.039), ('biases', 0.037), ('cd', 0.036), ('depth', 0.035), ('iro', 0.033), ('yann', 0.033), ('wi', 0.032), ('initialization', 0.032), ('circuit', 0.032), ('inputs', 0.031), ('gurations', 0.029), ('sigmoid', 0.028), ('divergence', 0.028), ('weights', 0.028), ('representation', 0.028), ('delalleau', 0.027), ('softmax', 0.027), ('bl', 0.027), ('neural', 0.027), ('hugo', 0.027), ('encoder', 0.026), ('fahlman', 0.026), ('lengell', 0.026), ('unlabelled', 0.026), ('wl', 0.026), ('truncated', 0.026), ('images', 0.026), ('exponential', 0.025), ('validation', 0.025), ('activation', 0.025), ('etwork', 0.025), ('neurons', 0.025), ('discriminative', 0.024), ('representations', 0.024), ('ck', 0.024), ('train', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000018 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin
Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artificial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning
2 0.086745948 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options
3 0.084732331 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal
Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL
4 0.078026474 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
5 0.062154662 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
6 0.060409553 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
7 0.053789843 90 jmlr-2009-Structure Spaces
8 0.048217945 23 jmlr-2009-Discriminative Learning Under Covariate Shift
9 0.044932354 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
10 0.044722021 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
11 0.042290013 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
12 0.041615374 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
13 0.039144937 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths
14 0.038545139 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
15 0.035465784 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
16 0.034473069 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
17 0.034340989 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
18 0.034261011 13 jmlr-2009-Bounded Kernel-Based Online Learning
19 0.033587545 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
20 0.03237972 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
topicId topicWeight
[(0, 0.183), (1, 0.008), (2, 0.031), (3, 0.01), (4, 0.085), (5, -0.09), (6, -0.009), (7, 0.039), (8, 0.002), (9, -0.025), (10, 0.058), (11, 0.124), (12, 0.071), (13, -0.131), (14, -0.165), (15, 0.039), (16, -0.333), (17, 0.195), (18, 0.022), (19, -0.304), (20, -0.046), (21, -0.213), (22, 0.037), (23, -0.086), (24, 0.02), (25, -0.085), (26, 0.048), (27, 0.025), (28, 0.016), (29, 0.025), (30, 0.163), (31, -0.047), (32, 0.002), (33, 0.052), (34, 0.081), (35, 0.048), (36, -0.063), (37, 0.007), (38, 0.12), (39, 0.001), (40, -0.078), (41, 0.02), (42, 0.034), (43, -0.054), (44, 0.045), (45, -0.031), (46, -0.051), (47, -0.003), (48, 0.005), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.95727044 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin
Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artificial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning
2 0.59919953 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options
Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal
Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL
4 0.41392446 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
5 0.30449128 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths
Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin
Abstract: We define a model of learning probabilistic acyclic circuits using value injection queries, in which fixed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that a polynomial time learning algorithm using general value injection experiments may not do much better than one using test paths. For probabilistic circuits with alphabets of size three or greater, we show that the test path lemmas (Angluin et al., 2009, 2008b) fail utterly. To overcome this obstacle, we introduce function injection queries, in which the values on a wire may be mapped to other values rather than just to themselves or constants, and prove a generalized test path lemma for this case. Keywords: nonadaptive learning algorithms, probabilistic circuits, causal Bayesian networks, value injection queries, test paths
6 0.29159951 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
7 0.24340399 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
8 0.24168214 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
9 0.23299678 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
10 0.22844031 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
11 0.22701888 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
12 0.2037414 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
13 0.19318846 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
14 0.18742868 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
15 0.17934114 90 jmlr-2009-Structure Spaces
16 0.17315972 23 jmlr-2009-Discriminative Learning Under Covariate Shift
17 0.17224398 13 jmlr-2009-Bounded Kernel-Based Online Learning
18 0.16639476 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification
19 0.16388878 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
20 0.16223986 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)
topicId topicWeight
[(8, 0.02), (11, 0.018), (19, 0.017), (26, 0.014), (38, 0.046), (47, 0.012), (52, 0.05), (55, 0.038), (58, 0.022), (66, 0.085), (68, 0.046), (87, 0.462), (90, 0.053), (96, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.80021584 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin
Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artificial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning
Author: Facundo Bromberg, Dimitris Margaritis
Abstract: We address the problem of improving the reliability of independence-based causal discovery algorithms that results from the execution of statistical independence tests on small data sets, which typically have low reliability. We model the problem as a knowledge base containing a set of independence facts that are related through Pearl’s well-known axioms. Statistical tests on finite data sets may result in errors in these tests and inconsistencies in the knowledge base. We resolve these inconsistencies through the use of an instance of the class of defeasible logics called argumentation, augmented with a preference function, that is used to reason about and possibly correct errors in these tests. This results in a more robust conditional independence test, called an argumentative independence test. Our experimental evaluation shows clear positive improvements in the accuracy of argumentative over purely statistical tests. We also demonstrate significant improvements on the accuracy of causal structure discovery from the outcomes of independence tests both on sampled data from randomly generated causal models and on real-world data sets. Keywords: independence-based causal discovery, causal Bayesian networks, structure learning, argumentation, reliability improvement 1. Introduction and Motivation Directed graphical models, also called Bayesian networks, can be used to represent the probability distribution of a domain. This makes them a useful and important tool for machine learning where a common task is inference, that is, predicting the probability distribution of a variable of interest given some other knowledge, usually in the form of values of other variables in the domain. An additional use of Bayesian networks comes by augmenting them with causal semantics that represent cause and effect relationships in the domain. The resulting networks are called causal. An important problem is inferring the structure of these networks, a process that is sometimes called causal discovery, which can provide insights into the underlying data generation process. Two major classes of algorithms exist for learning the structure of Bayesian networks. One class contains so-called score-based methods, which learn the structure by conducting a search in the space of all structures in an attempt to find the structure of maximum score. This score is usually penalized log-likelihood, for example, the Bayesian Information Criterion (BIC) or the (equivalent) Minimum Description Length (MDL). A second class of algorithms works by exploiting the fact that a causal Bayesian network implies the existence of a set of conditional independence statements between sets of domain variables. Algorithms in this class use the outcomes of a number of condic 2009 Facundo Bromberg and Dimitris Margaritis. B ROMBERG AND M ARGARITIS tional independences to constrain the set of possible structures consistent with these to a singleton (if possible) and infer that structure as the only possible one. As such they are called constraintbased or independence-based algorithms. In this paper we address open problems related to the latter class of algorithms. It is well-known that independence-based algorithms have several shortcomings. A major one has to do with the effect that unreliable independence information has on the their output. In general such independence information comes from two sources: (a) a domain expert that can provide his or her opinion on the validity of certain conditional independences among some of the variables, sometimes with a degree of confidence attached to them, and/or (b) statistical tests of independence, conducted on data gathered from the domain. As expert information is often costly and difficult to obtain, (b) is the most commonly used option in practice. A problem that occurs frequently however is that the data set available may be small. This may happen for various reasons: lack of subjects to observe (e.g., in medical domains), an expensive data-gathering process, privacy concerns and others. Unfortunately, the reliability of statistical tests significantly diminishes on small data sets. For example, Cochran (1954) recommends that Pearson’s χ 2 independence test be deemed unreliable if more than 20% of the cells of the test’s contingency table have an expected count of less than 5 data points. Unreliable tests, besides producing errors in the resulting causal model structure, may also produce cascading errors due to the way that independence-based algorithms work: their operation, including which test to evaluate next, typically depends on the outcomes of previous ones. Thus a single error in a statistical test can be propagated by the subsequent choices of tests to be performed by the algorithm, and finally when the edges are oriented. Therefore, an error in a previous test may have large (negative) consequences in the resulting structure, a property that is called instability in Spirtes et al. (2000). One possible method for addressing the effect of multiple errors in the construction of a causal model through multiple independence tests is the Bonferroni correction (Hochberg, 1988; Abdi, 2007), which works by dividing the type I error probability α of each test by the number of such tests evaluated during the entire execution of the causal learning algorithm. As a result, the collective type I error probability (of all tests evaluated) is α, that is, 0.05 typically. However, this may make the detection of true dependences harder, as now larger data sets would be required to reach the adjusted confidence threshold of each test. The types of adjustments that may be appropriate for each case to tests that may be dependent is an open problem and the subject of current research in statistics (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001; Storey, 2002). In this paper we present and evaluate a number of methods for increasing the reliability of independence tests for small data sets. A result of this is the improvement in reliability of independencebased causal discovery algorithms that use these data sets, as we demonstrate in our experiments. We model this setting as a knowledge base whose contents are propositions representing conditional independences that may contain errors. Our main insight is to recognize that the outcomes of independence tests are not themselves independent but are constrained by the outcomes of other tests through Pearl’s well-known properties of the conditional independence relation (Pearl, 1988; Dawid, 1979). These can therefore be seen as integrity constraints that can correct certain inconsistent test outcomes, choosing instead the outcome that can be inferred by tests that do not result in contradictions. We illustrate this by an example. 302 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Example 1. Consider an independence-based knowledge base that contains the following propositions, obtained through statistical tests on data. ({0}⊥ ⊥{1} | {2}) (1) ({0}⊥ {3} | {2}) ⊥ (2) ({0}⊥ ⊥{3} | {1, 2}) (3) where (X⊥ | Z) denotes conditional independence of the set of variables X with Y conditional ⊥Y on set Z, and (X⊥ Y | Z) denotes conditional dependence. Suppose that (3) is in fact wrong. Such ⊥ an error can be avoided if there exists a constraint involving these independence propositions. For example, suppose that we also know that the following rule holds in the domain (this is an instance of an application of the Contraction and Decomposition axioms, described later in Section 2): ({0}⊥ ⊥{1} | {2}) ∧ ({0}⊥ {3} | {2}) =⇒ ({0}⊥ {3} | {1, 2}). ⊥ ⊥ (4) Rule (4), together with independence proposition (1) and dependence proposition (2), contradict independence proposition (3), resulting in an inconsistent knowledge base. If Rule (4) and propositions (1) and (2) are accepted, then proposition (3) must be rejected (and its value reversed), correcting the error in this case. The framework presented in the rest of the paper provides a principled approach for resolving such inconsistencies. The situation described in the previous example, while simple, illustrates the general idea that we will use in the rest of the paper: the set of independences and dependences used in a causal discovery algorithm form a potentially inconsistent knowledge base, and making use of general rules, derived from axioms and theorems that we know hold in the domain, helps us correct certain outcomes of statistical tests. In this way we will be able to improve the reliability of causal discovery algorithms that use them to derive causal models. To accomplish this we use the framework of argumentation, which provides a sound and elegant way of resolving inconsistencies in such knowledge bases, including ones that contain independences. The rest of the paper is organized as follows. The next section introduces our notation and definitions. Section 3 presents the argumentation framework and its extension with preferences, and describes our approach for applying it to represent and reason in knowledge bases containing independence facts that may be inconsistent. Section 4 introduces the argumentative independence test, implemented by the top-down algorithm introduced in Section 5. We then present an approximation for the top-down algorithm in Section 6 that reduces its time complexity to polynomial. We experimentally evaluate our approach in Section 7, and conclude with a summary and possible directions of future research in Section 8. Most of the proofs are presented in detail in Appendices A and B, which contain proofs for the computability (termination) and the validity (no AIT test can return a dependence and an independence result at the same time) of AIT, respectively. Note that, as our main goal in this paper is to address the problem of robust causal learning and not necessarily to advance the theory of argumentation itself, our exposition in the rest of the paper is geared toward causality theorists and practitioners. As this community may be unfamiliar with the theory and methods of the argumentation framework, we have included a self-contained discussion that covers the basic definitions and theorems of argumentation theory in some detail. 303 B ROMBERG AND M ARGARITIS 2. Notation and Preliminaries In this work we denote random variables with capitals (e.g., X,Y, Z) and sets of variables with bold capitals (e.g., X, Y, Z). In particular, we denote by V = {1, . . . , n} the set of all n variables in the domain, naming the variables by their indices in V; for instance, we refer to the third variable in V simply by 3. We assume that all variables in the domain are discrete following a multinomial distribution or are continuous following a Gaussian distribution. We denote the data set by D and its size (number of data points) by N. We use the notation (X⊥ | Z) to denote that the variables in set ⊥Y X are (jointly) independent of those in Y conditional on the values of the variables in Z, for disjoint sets of variables X, Y, and Z, while (X⊥ Y | Z) denotes conditional dependence. For the sake of ⊥ readability, we slightly abuse this notation and use (X⊥ | Z) as shorthand for ({X}⊥ } | {Z}). ⊥Y ⊥{Y A Bayesian network (BN) is a directed graphical model which represents the joint probability distribution over V. Each node in the graph represents one of the random variables in the domain. The structure of the network implicitly represents a set of conditional independences on the domain variables. Given the structure of a BN, the set of independences implied by it can be identified by a process called d-separation (Pearl, 1988); the latter follows from the local Markov property that states that each node in the network is conditionally independent of all its non-descendants in the graph given its parents. All independences identified by d-separation are implied by the model structure. If, in addition, all remaining triplets (X, Y, Z) correspond to dependencies, we say that the BN is directed graph-isomorph (abbreviated DAG-isomorph) or simply causal (as defined by Pearl, 1988). The concept of DAG-isomorphism is equivalent to a property called Faithfulness in Spirtes et al. (2000). A graph G is said to be faithful to some distribution if exactly those independences that exist in the distribution and no others are returned by the process of d-separation on G. In this paper we assume Faithfulness. For learning the structure of the Bayesian network of a domain we make use of the PC algorithm (Spirtes et al., 2000), which is only able to correctly identify the structure under the assumption of causal sufficiency. We therefore also assume causal sufficiency. A domain is causally sufficient if it does not contain any hidden or latent variables. As mentioned above, independence-based algorithms operate by conducting a series of conditional independence queries. For these we assume that an independence-query oracle exists that is able to provide such information. This approach can be viewed as an instance of the statistical query oracle theory of Kearns and Vazirani (1994). In practice such an oracle does not exist, but can be implemented approximately by a statistical test evaluated on the data set (for example, this can be Pearson’s conditional independence χ2 (chi-square) test (Agresti, 2002), Wilk’s G2 test, a mutual information test etc.). In this work we used Wilk’s G2 test (Agresti, 2002). To determine conditional independence between two variables X and Y given a set Z from data, the statistical test G2 (and many other independence tests based on hypothesis testing, for example, the χ 2 test) uses the values in the contingency table (a table containing the data point counts for each possible combination of the variables that participate in the test) to compute a test statistic. For a given value of the test statistic, the test then computes the likelihood of obtaining that or a more extreme value by chance under the null hypothesis, which in our case is that the two variables are conditionally independent. This likelihood, called the p-value of the test, is then returned. The p-value of a test equals the probability of falsely rejecting the null hypothesis (independence). Assuming that the p-value of a test is p(X,Y | Z), the statistical test concludes independence if and only if p(X,Y | Z) is greater than a threshold α, that is, (X⊥ | Z) ⇐⇒ p(X,Y | Z) > α. ⊥Y 304 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (Symmetry) (Composition) (Decomposition) (Intersection) (Weak Union) (Contraction) (Weak Transitivity) (Chordality) (X⊥ | Z) ⇐⇒ (Y⊥ | Z) ⊥Y ⊥X (X⊥ ∪ W | Z) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥Y ⊥W | Z) (X⊥ ∪ W | Z) =⇒ (X⊥ | Z ∪ W) ⊥Y ⊥Y (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z) ⇐⇒ (Y⊥ | Z) ⊥Y ⊥X (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ ∪ W | Z) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥Y ⊥W | Z) (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ ∪ W | Z) =⇒ (X⊥ | Z ∪ W) ⊥Y ⊥Y (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) =⇒ (X⊥ | Z) ∨ (γ⊥ | Z) ⊥Y ⊥Y ⊥γ ⊥Y (α⊥ | γ ∪ δ) ∧ (γ⊥ | α ∪ β) =⇒ (α⊥ | γ) ∨ (α⊥ | δ) ⊥β ⊥δ ⊥β ⊥β (5) (6) Common values in statistics for α are 0.05 and 0.01, corresponding to confidence thresholds (1 − α) of 0.95 and 0.99 respectively. The value 0.10 for α is also sometimes used, depending on the application, while values as low as 0.005 and 0.001 are sometimes used for structure learning. The conditional independences and dependences of a domain are connected through a set of general rules, introduced in Pearl (1988) and shown boxed in Eq. (5). These can be seen as constraints in a meta-space representing all possible independences in the domain. More specifically, let us imagine a meta-space of binary variables, each corresponding to the truth value of the independence of a triplet (X, Y | Z) (e.g., true for independence and false for dependence). Each point in this space corresponds to a conditional independence assignment to all possible triplets in the domain. In this conceptual space not all points are tenable; in particular the set of rules of Eq. (5) constrain the truth values of independences corresponding to triplets. For domains for which there exists a faithful Bayesian network a more relaxed set of properties hold, shown boxed in Eq. (6) where α, β, γ and δ correspond to single variables. In both sets of axioms, the property of Intersection holds if the probability distribution of the domain is positive, meaning that every assignment to all variables in the domain has a non-zero probability. Eq. (6) were first introduced by Dawid (1979) in a slightly different form and independently re-discovered by Pearl and Paz (1985). ´ Note that the axioms of Eq. (5) are necessarily incomplete; Studen y (1991) showed that there is no finite axiomatization of the conditional independence relation in general. The implication of this is that there may be some inconsistencies involving some set of independences and dependences that no method can detect and resolve. In the next section we describe the argumentation framework, which allows one to make beneficial use of these constraints. This is followed by its application to our problem of answering independence queries from knowledge bases that contain sets of potentially inconsistent independence propositions. 3. The Argumentation Framework There exist two major approaches for reasoning with information contained in inconsistent knowledge bases such as those containing independence statements that were described in the previous 305 B ROMBERG AND M ARGARITIS section. These two distinct approaches correspond to two different attitudes: One is to resolve the inconsistencies by removing a subset of propositions such that the resulting KB becomes consistent; this is called belief revision in the literature (G¨ rdenfors, 1992; G¨ rdenfors and Rott, 1995; Shapiro, a a 1998; Martins, 1992). A potential shortcoming (Shapiro, 1998) of belief revision stems from the fact that it removes propositions, which discards potentially valuable information. In addition, an erroneous modification of the KB (such as the removal of a proposition) may have unintended negative consequences if later more propositions are inserted in the KB. A second approach to inconsistent KBs is to allow inconsistencies but to use rules that may be possibly contained in it to deduce which truth value of a proposition query is “preferred” in some way. One instance of this approach is argumentation (Dung, 1995; Loui, 1987; Prakken, 1997; Prakken and Vreeswijk, 2002), which is a sound approach that allows inconsistencies but uses a proof procedure that is able to deduce (if possible) that one of the truth values of certain propositions is preferred over its negation. Argumentation is a reasoning model that belongs to the broader class of defeasible logics (Pollock, 1992; Prakken, 1997). Our approach uses the argumentation framework of Amgoud and Cayrol (2002) that considers preferences over arguments, extending Dung’s more fundamental framework (Dung, 1995). Preference relations give an extra level of specificity for comparing arguments, allowing a more refined form of selection between conflicting propositions. Preference-based argumentation is presented in more detail in Section 3.2. We proceed now to describe the argumentation framework. Definition 1. An argumentation framework is a pair A , R , where A is a set of arguments and R is a binary relation representing a defeasibility relationship between arguments, that is, R ⊆ A × A . (a, b) ∈ R or equivalently “a R b” reads that argument a defeats argument b. We also say that a and b are in conflict. An example of the defeat relation R is logical defeat, which occurs when an argument contradicts another logically. The elements of the argumentation framework are not propositions but arguments. Given a potentially inconsistent knowledge base K = Σ, Ψ with a set of propositions Σ and a set of inference rules Ψ, arguments are defined formally as follows. Definition 2. An argument over knowledge base Σ, Ψ is a pair (H, h) where H ⊆ Σ such that: • H is consistent, • H Ψ h, • H is minimal (with respect to set inclusion). H is called the support and h the conclusion or head of the argument. In the above definition Ψ stands for classical logical inference over the set of inference rules Ψ. Intuitively an argument (H, h) can be thought as an “if-then” rule, that is, “if H then h.” In inconsistent knowledge bases two arguments may contradict or defeat each other. The defeat relation is defined through the rebut and undercut relations, defined as follows. Definition 3. Let (H1 , h1 ), (H2 , h2 ) be two arguments. • (H1 , h1 ) rebuts (H2 , h2 ) iff h1 ≡ ¬h2 . 306 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Algorithm 1 Recursive computation of acceptable arguments: Acc R = F (A , R , S) 1: S ←− S ∪ {a ∈ A | a is defended by S} 2: if S = S then 3: return S 4: else 5: return F (A , R , S ) • (H1 , h1 ) undercuts (H2 , h2 ) iff ∃h ∈ H2 such that h ≡ ¬h1 . If (H1 , h1 ) rebuts or undercuts (H2 , h2 ) we say that (H1 , h1 ) defeats (H2 , h2 ). (The symbol “≡” stands for logical equivalence.) In other words, (H1 , h1 ) R (H2 , h2 ) if and only if (H1 , h1 ) rebuts or undercuts (H2 , h2 ). The objective of argumentation is to decide on the acceptability of a given argument. There are three possibilities: an argument can be accepted, rejected, or neither. This partitions the space of arguments A in three classes: • The class AccR of acceptable arguments. Intuitively, these are the “good” arguments. In the case of an inconsistent knowledge base, these will be inferred from the knowledge base. • The class RejR of rejected arguments. These are the arguments defeated by acceptable arguments. When applied to an inconsistent knowledge base, these will not be inferred from it. • The class AbR of arguments in abeyance. These arguments are neither accepted nor rejected. The semantics of acceptability proposed by Dung (1995) dictates that an argument should be accepted if it is not defeated, or if it is defended by acceptable arguments, that is, each of its defeaters is itself defeated by an acceptable argument. This is formalized in the following definitions. Definition 4. Let A , R be an argumentation framework, and S ⊆ A . An argument a is defended by S if and only if ∀b, if (b R a) then ∃c ∈ S such that (c R b). Dung characterizes the set of acceptable arguments by a monotonic function F , that is, F (S) ⊆ F (S ∪ T ) for some S and T . Given a set of arguments S ⊆ A as input, F returns the set of all arguments defended by S: Definition 5. Let S ⊆ A . Then F (S) = {a ∈ A | a is defended by S}. Slightly overloading our notation, we define F (∅) to contain the set of arguments that are not defeated by any argument in the framework. Definition 6. F (∅) = {a ∈ A | a is not defeated by any argument in A }. Dung proved that the set of acceptable arguments is the least fix-point of F , that is, the smallest set S such that F (S) = S. Theorem 7 (Dung 1995). Let A , R be an argumentation framework. The set of acceptable arguments AccR is the least fix-point of the function F . 307 B ROMBERG AND M ARGARITIS Dung also showed that if the argumentation framework A , R is finitary, that is, for each argument A there are finitely many arguments that defeat A, the least fix-point of function F can be obtained by iterative application of F to the empty set. We can understand this intuitively: From our semantics of acceptability it follows that all arguments in F (∅) are accepted. Also, every argument in F (F (∅)) must be acceptable as well since each of its arguments is defended by acceptable arguments. This reasoning can be applied recursively until a fix-point is reached. This happens when the arguments in S cannot be used to defend any other argument not in S, that is, no additional argument is accepted. This suggests a simple algorithm for computing the set of acceptable arguments. Algorithm 1 shows a recursive procedure for this, based on the above definition. The algorithm takes as input an argumentation framework A , R and the set S of arguments found acceptable so far, that is, S = ∅ initially. Let us illustrate these ideas with an example. Example 2. Let A , R be an argumentation framework defined by A = {a, b, c} and R = {(a, b), (b, c)}. The only argument that is not defeated is a, and therefore F (∅) = {a}. Argument b is defeated by the acceptable argument a, so b cannot be defended and is therefore rejected, that is, b ∈ RejR . Argument c, though defeated by b, is defended by (acceptable argument) a which defeats b, so c is acceptable. The set of acceptable arguments is therefore Acc R = {a, c} and the set of rejected arguments is RejR = {b}. The bottom-up approach of Algorithm 1 has the disadvantage that it requires the computation of all acceptable arguments to answer the acceptability status of a single one. In practice, and in particular in the application of argumentation to independence tests, the entire set of acceptable arguments is rarely needed. An alternative is to take a top-down approach (Amgoud and Cayrol, 2002; Dung, 1995; Toni and Kakas, 1995; Kakas and Toni, 1999) that evaluate the acceptability of some input argument by evaluating (recursively) the acceptability of its attackers. Below we present an alternative algorithm, called the top-down algorithm, for deciding the acceptability of an input argument. This algorithm is a version of the dialog tree algorithm of Amgoud and Cayrol (2002), where details unnecessary for the current exposition are not shown. This algorithm is provably equivalent to Algorithm 1 (whenever it is given the same input it is guaranteed to produce the same output), but it is considerably more efficient (as shown later in Section 5.2). We sketch the algorithm here and show a concrete version using the preference-based argumentation framework in Section 3.2. Given an input argument a, the top-down algorithm employs a goal-driven approach for answering whether a is accepted or not. Its operation is guided by the same acceptability semantics as those used for Algorithm 1. Let us denote the predicates A(a) ≡ (a ∈ Acc R ), R(a) ≡ (a ∈ RejR ), and Ab(a) ≡ (a ∈ AbR ). Then, the acceptability semantics are as follows. Algorithm 2 Top-down computation of acceptable arguments: top-down(A , R , a) 1: defeaters ← set of arguments in A that defeat a according to R . 2: for d ∈ defeaters do 3: if top-down(A , R , d) = accepted then 4: return rejected 5: return accepted 308 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION (Acceptance) A node is accepted iff it has no defeaters or all its defeaters are rejected: A(a) ⇐⇒ ∀b ∈ defeaters(a), R(b). (Rejection) A node is rejected iff at least one of its defeaters is accepted: R(a) ⇐⇒ ∃b ∈ defeaters(a), A(b). (7) (Abeyance) A node is in abeyance iff its not accepted nor rejected: Ab(a) ⇐⇒ ¬A(a) ∧ ¬R(a). The logic of these equations can be easily implemented with a recursive algorithm, shown in Algorithm 2. The algorithm, given some input argument a, loops over all defeaters of a and responds rejected if any of its defeaters is accepted (line 4). If execution reaches the end of the loop at line 5 then that means that none of its defeaters was accepted, and thus the algorithm accepts the input argument a. We can represent the execution of the top-down algorithm graphically by a tree that contains a at the root node, and all the defeaters of a node as its children. A leaf is reached when a node has no defeaters. In that case the loop contains no iterations and line 5 is reached trivially. Unfortunately, the top-down algorithm, as shown in Algorithm 2, will fail to terminate when a node is in abeyance. This is clear from the following lemma (proved formally in Appendix A but reproduced here to aid our intuition). Lemma 8. For every argument a, Ab(a) =⇒ ∃b ∈ attackers(a), Ab(b). (An attacker is a type of defeater; it is explained in detail in the next section. For the following discussion the reader can substitute “attacker” with “defeater” in the lemma above.) From this lemma we can see that, if an argument is in abeyance, its set of defeaters must contain an argument in abeyance and thus the recursive call of the top-down algorithm will never terminate, as there will always be another defeater in abeyance during each call. While there are ways to overcome this difficulty in the general case, we can prove that using the preference-based argumentation framework (presented later in the paper) and for the particular preference relation introduced for deciding on independence tests (c.f. Section 3.3), no argument can be in abeyance and thus the top-down algorithm always terminates. A formal proof of this is presented later in Section 5. We conclude the section by proving that the top-down algorithm is equivalent to the bottom-up algorithm of Algorithm 1 that is, given the same input as Algorithm 1 it is guaranteed to produce the same output. The proof assumes no argument is in abeyance. This assumption is satisfied for argumentation in independence knowledge bases (c.f. Theorem 20, Section 5). Theorem 9. Let a be an argument in the argumentation framework A , R , and let F be the set of acceptable arguments output by Algorithm 1. Assuming a is not in abeyance, top-down(A , R , a) = accepted ⇐⇒ a ∈ F (8) top-down(A , R , a) = rejected ⇐⇒ a ∈ F . / (9) 309 B ROMBERG AND M ARGARITIS Proof According to Theorem 7, the fix point of function F returned by Algorithm 1 contains the set of arguments considered acceptable by the acceptability semantics of Dung. As the topdown algorithm is a straightforward implementation of Dung’s acceptability semantics expressed by Eq. (7), the double implication of Eq. (8) must follow. To prove Eq. (9) we can prove the equivalent expression with both sides negated, that is, top-down(A , R , a) = rejected ⇐⇒ a ∈ F . Since a is not in abeyance, if the top-down algorithm does not return rejected it must return accepted. The double implication is thus equivalent to Eq. (8), which was proved true. 3.1 Argumentation in Independence Knowledge Bases We can now apply the argumentation framework to our problem of answering queries from knowledge bases that contain a number of potentially inconsistent independences and dependencies and a set of rules that express relations among them. Definition 10. An independence knowledge base (IKB) is a knowledge base Σ, Ψ such that its set of propositions Σ contains independence propositions of the form (X⊥ | Z) or (X⊥ Y | Z) for ⊥Y ⊥ X, Y and Z disjoint subsets of V, and its set of inference rules Ψ is either the general set of axioms shown in Eq. (5) or the specific set of axioms shown in Eq. (6). For IKBs, the set of arguments A is obtained in two steps. First, for each proposition σ ∈ Σ (independence or dependence) we add to A the argument ({σ}, σ). This is a valid argument according to Definition 2 since its support {σ} is (trivially) consistent, it (trivially) implies the head σ, and it is minimal (the pair (∅, σ) is not a valid argument since ∅ is equivalent to the proposition true which does not entail σ in general). We call arguments of the form ({σ}, σ) propositional arguments since they correspond to single propositions. The second step in the construction of the set of arguments A concerns rules. Based on the chosen set of axioms (general or directed) we construct an alternative, logically equivalent set of rules Ψ , each member of which is singleheaded, that is, contains a single proposition as the consequent, and decomposed, that is, each of its propositions is an independence statement over single variables (the last step is justified by the fact that typical algorithms for causal learning never produce nor require the evaluation of independence between sets). To construct the set of single-headed rules we consider, for each axiom, all possible contrapositive versions of it that have a single head. To illustrate, consider the Weak Transitivity axiom (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) =⇒ (X⊥ | Z) ∨ (γ⊥ | Z) ⊥Y ⊥Y ⊥γ ⊥Y from which we obtain the following set of single-headed rules: (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) ∧ (X⊥ γ | Z) ⊥Y ⊥Y ⊥ =⇒ (γ⊥ | Z) ⊥Y (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) ∧ (γ⊥ Y | Z) ⊥Y ⊥Y ⊥ =⇒ (X⊥ | Z) ⊥γ (X⊥ | Z ∪ γ) ∧ (γ⊥ Y | Z) ∧ (X⊥ γ | Z) ⊥Y ⊥ ⊥ =⇒ (X⊥ Y | Z) ⊥ (X⊥ | Z) ∧ (γ⊥ Y | Z) ∧ (X⊥ γ | Z) ⊥Y ⊥ ⊥ =⇒ (X⊥ Y | Z ∪ γ). ⊥ 310 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION To obtain decomposed rules we apply the Decomposition axiom to every single-headed rule to produce only propositions over singletons. To illustrate, consider the Intersection axiom: (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z). ⊥Y In the above the consequent coincides with the antecedent of the Decomposition axiom, and we thus replace the Intersection axiom with a decomposed version: (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z). Finally, note that it is easy to show that this rule is equivalent to two single-headed rules, one implying (X⊥ | Z) and the other implying (X⊥ ⊥Y ⊥W | Z). The result of the application of the above procedures is a set of single-headed, decomposed rules Ψ . We construct, for each such rule (Φ1 ∧ Φ2 . . . ∧ Φn =⇒ ϕ) ∈ Ψ and for each subset of Σ that matches exactly the set of antecedents, that is, each subset {ϕ 1 , ϕ2 . . . , ϕn } of Σ such that Φ1 ≡ ϕ1 , Φ2 ≡ ϕ2 . . . Φn ≡ ϕn , the argument ({ϕ1 ∧ ϕ2 ∧ . . . ∧ ϕn }, ϕ), and add it to A .1 IKBs can be augmented with a set of preferences that allow one to take into account the reliability of each test when deciding on the truth value of independence queries. This is described in the next section. 3.2 Preference-based Argumentation Framework Following Amgoud and Cayrol (2002), we now refine the argumentation framework of Dung (1995) for cases where it is possible to define a preference order Π over arguments. Definition 11. A preference-based argumentation framework (PAF) is a triplet A , R , Π where A is a set of arguments, R ⊆ A × A is a binary relation representing a defeat relationship between pairs of arguments, and Π is a (partial or total) order over A . For the case of inconsistent knowledge bases, preference Π over arguments follows the preference π over their support, that is, stronger support implies a stronger argument, which is given as a partial or total order over sets of propositions. Formally: Definition 12. Let K = Σ, Ψ be a knowledge base, π be a (partial or total) order on subsets of Σ and (H, h), (H , h ) two arguments over K . Argument (H, h) is π-preferred to (H , h ) (denoted (H, h) π (H , h )) if and only if H is preferred to H with respect to π. In what follows we overload our notation by using π to denote either the ordering over arguments or over their supports. An important sub-class of preference relations is the strict and transitive preference relation, defined as follows. Definition 13. We say that preference relation π over arguments is strict if the order of arguments induced by it is strict and total, that is, for every pair of arguments a and b, a π b ⇐⇒ ¬ b π a . 1. This is equivalent to propositionalizing the set of rules, which are first-order (the rules of Eqs. (5) and (6) are universally quantified over all sets of variables, and thus are the rules in Ψ ). As this may be expensive (exponential in the number of propositions), in practice it is not implemented in this way; instead, appropriate rules are matched on the fly during the argumentation inference process. 311 B ROMBERG AND M ARGARITIS Definition 14. We say that preference relation π over arguments is transitive if, for every three arguments a, b and c, a π b ∧ b π c =⇒ a π c . The importance of the properties of strictness and transitivity will become clear later when we talk about the correctness of the argumentative independence test (defined later in Section 4). We now introduce the concept of attack relation, a combination of the concepts of defeat and preference relation. Definition 15. Let A , R , π be a PAF, and a, b ∈ A be two arguments. We say b attacks a if and only if b R a and ¬(a π b). We can see that the attack relation is a special case of the defeat relation and therefore the same conclusions apply; in particular Theorem 7, which allows us to compute the set of acceptable arguments of a PAF using Algorithm 1 or Algorithm 2. In Sections 3.3 and 4 below, we apply these ideas to construct an approximation to the independencequery oracle that is more reliable than a statistical independence test. 3.3 Preference-based Argumentation in Independence Knowledge Bases We now describe how to apply the preference-based argumentation framework of Section 3.2 to improve the reliability of conditional independence tests conducted on a (possibly small) data set. A preference-based argumentation framework has three components. The first two, namely A and R , are identical to the general argumentation framework. We now describe how to construct the third component, namely the preference order π over subsets H of Σ, in IKBs. We define it using a belief estimate ν(H) that all propositions in H are correct, H π H ⇐⇒ ν(H) > ν(H ) ∨ ν(H) = ν(H ) ∧ f (H, H ) . (10) That is, H is preferred over H if and only if its belief of correctness is higher than that of H or, in the case that these beliefs are equal, we break the tie using predicate f . For that we require that ∀H, H ⊆ A , such that H = H , f (H, H ) = ¬ f (H , H). (11) In addition, we require that f be transitive, that is, f (H, H ) ∧ f (H , H ) =⇒ f (H, H ). This implies that the preference relation π is transitive, which is a necessary condition for proving a number of important theorems in Appendix A. In our implementation we resolved ties by assuming an arbitrary order of the variables in the domain, determined at the beginning of the algorithm and maintained fixed during its entire execution. Based on this ordering, f (H, H ) resolved ties by the lexicographic order of the variables in H and H . By this definition, our f is both non-commutative and transitive. Before we define ν(H) we first show that π, as defined by Eqs. (10) and (11) and for any definition of ν(H), satisfies two important properties, namely strictness (Definition 13) and transitivity (Definition 14). We do this in the following two lemmas. Lemma 16. The preference relation for independence knowledge bases defined by Equations (10) and (11) is strict. 312 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Proof H π H ⇐⇒ ν(H) > ν(H ) ∨ ν(H) = ν(H ) ∧ f (H, H ) [By Eq. (10)] ⇐⇒ ν(H) ≥ ν(H ) ∧ ν(H) > ν(H ) ∨ f (H, H ) [Distributivity of ∨ over ∧] ⇐⇒ ¬ ν(H ) > ν(H) ∨ ν(H ) ≥ ν(H) ∧ f (H , H) [Double negation and Eq. (11)] ν(H ) > ν(H) ∨ ν(H ) ≥ ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ⇐⇒ ¬ ⇐⇒ ¬ ν(H ) ≥ ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ν(H ) > ν(H) ∨ ν(H ) = ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ⇐⇒ ¬ ⇐⇒ ¬ ν(H ) > ν(H) ∨ ν(H ) = ν(H) ∧ f (H , H) ⇐⇒ ¬(H π [Common factor ν(H ) > ν(H)] H) [Again by Eq. (10)] Lemma 17. The preference relation defined by Equations (10) and (11) is transitive. Proof H π J ∧J ⇐⇒ π K ν(H) > ν(J) ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) ∨ ν(J) = ν(K) ∧ f (J, K) ⇐⇒ [By Eq. (10)] ν(H) > ν(J) ∧ ν(J) > ν(K) [Case A] ∨ ν(H) > ν(J) ∧ ν(J) = ν(K) ∧ f (J, K) [Case B] ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) [Case C] ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) = ν(K) ∧ f (J, K) [Case D] To complete the proof we show that each of the cases A, B, C and D implies H (Case A) ν(H) > ν(J) ∧ ν(J) > ν(K) =⇒ ν(H) > ν(K) =⇒ H π π K. K. (Case B) ν(H) > ν(J) ∧ ν(J) = ν(K) ∧ f (J, K) =⇒ ν(H) > ν(K) =⇒ H π K. (Case C) ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) =⇒ ν(H) > ν(K) =⇒ H π K. (Case D) ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) = ν(K) ∧ f (J, K) =⇒ ν(H) = ν(K) ∧ f (H, K) =⇒ H due to the transitivity of predicate f . 313 π K, B ROMBERG AND M ARGARITIS We now return to the computation of ν(H). We estimate the belief ν(H) that a set of propositions H is correct by assuming independence among these propositions. 2 Overloading notation and denoting by ν(h) the probability of an individual proposition h being correct, the probability of all elements in H being correct under this assumption of independence is ν(H) = ∏ ν(h). (12) h∈H The belief that a proposition stating independence is correct can be computed in different ways, depending on the particular choice of independence oracle chosen. In this paper we use Wilk’s G 2 test, but the resulting belief can be easily adapted to any other independence oracle that produces p-values. We hope that the following discussion serves as a starting point for others to adapt it to other types of independence oracles. As discussed in Section 2, the p-value p(X,Y | Z) computed by this test is the probability of error in rejecting the null hypothesis (conditional independence in our case) and assuming that X and Y are dependent. Therefore, the probability of a test returning dependence of being correct is νD (X ⊥ | Z) = 1 − p(X,Y | Z) ⊥Y where the subscript D indicates that this expression is valid only for dependencies. Formally, the error of falsely rejecting the null hypothesis is called a type I error. To determine the preference of a test returning independence we can, in principle, use this procedure symmetrically: use the probability of error in falsely accepting the null hypothesis (again, this is conditional independence), called a type II error, which we denote by β(X,Y | Z). In this case we can define the preference of independence (X⊥ | Z) as the probability of correctly assuming independence by ⊥Y νI (X⊥ | Z) = 1 − β(X,Y | Z) ⊥Y where again the subscript I indicates that it is valid only for independences. Unfortunately value of β cannot be obtained without assumptions, because it requires the computation of the probability of the test statistic under the hypothesis of dependence, and there are typically an infinite number of dependent models. In statistical applications, the β value is commonly approximated by assuming one particular dependence model if prior knowledge about that is available. In the absence of such information however in this paper we estimate it using a heuristic function of the p-value, assuming the following heuristic constraints on β: 1 α α − 2+|Z| β(X,Y | Z) = α if p(X,Y | Z) = 0 if p(X,Y | Z) = 1 if p(X,Y | Z) = α. The first constraint (for p(X,Y | Z) = 0) corresponds to the intuition that when the p-value of the test is close to 0, the test statistic is very far from its value under the model that assumes independence, and thus we would give more preference to the “dependence” decision. The intuition for 2. The assumption of independence is a heuristic, and is made mainly due to the difficulty of determining the dependence between two or more statistical tests evaluated on the same data set. Other possible ways of defining the preference of a set of propositions are possible. The problem of dealing with multiple tests is an open problem and an area of active research in statistics; see Section 1 for a discussion. 314 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Figure 1: Preference functions νI (h) and νD (h) for statements of independence and dependence respectively, as functions of the p-value of test h. the second case (p(X,Y | Z) = 1) is reversed—when the value of the statistic is very close to the expected one under independence then independence is preferred. The value of the second case is tempered by the number of variables in the conditioning set. This reflects the practical consideration that, as the number 2 + |Z| of variables involved in the test increases, given a fixed data set, the discriminatory power of the test diminishes as |Z| → ∞. The third case causes the two functions ν I and νD to intersect at p-value α. This is due to fairness: in the absence of non-propositional arguments (i.e., in the absence of inference rules in our knowledge base), the independence decisions of the argumentation framework should match those of the purely statistical tests, that is, “dependence” if and only if (p-value ≤ α). If instead we chose a different intersection point, then the resulting change in the outcome of tests may have been simply due to bias in the independence decision that favors dependence or independence, that is, equivalent to an arbitrary change of the threshold of the statistical test, and the comparison of the statistical and the new test based on argumentation would not be a fair one. The remaining values of β are approximated by linear interpolation among the above points. The result is summarized in Fig. 1, which depicts preference functions ν D and νI with respect to the p-value of the corresponding statistical test. Let us illustrate how the preference-based argumentation can be used to resolve the inconsistencies of Example 1. Example 3. In example 1 we considered an IKB with the following propositions (0⊥ | 2) ⊥1 (13) (0⊥ 3 | 2) ⊥ (14) (0⊥ | {1, 2}) ⊥3 (15) (0⊥ | 2) ∧ (0⊥ 3 | 2) =⇒ (0⊥ 3 | {1, 2}). ⊥1 ⊥ ⊥ (16) 315 B ROMBERG AND M ARGARITIS Following the IKB construction procedure described in the previous section, propositions (13), (14) and (15) correspond to the following arguments, respectively: (0⊥ | 2) , (0⊥ | 2) ⊥1 ⊥1 (0⊥ 3 | 2) , (0⊥ 3 | 2) ⊥ ⊥ (0⊥ | {1, 2}) , (0⊥ | {1, 2}) ⊥3 ⊥3 (17) while rule (16) corresponds to the argument (0⊥ | 2), (0⊥ 3 | 2) , (0⊥ 3 | {1, 2}) . ⊥1 ⊥ ⊥ (18) Let us extend this IKB with the following preference values for its propositions and rule. Pref [(0⊥ | 2)] = 0.8 ⊥1 Pref [(0⊥ 3 | 2)] = 0.7 ⊥ Pref [(0⊥ | {1, 2})] = 0.5. ⊥3 According to Definition (12), the preference of each argument ({σ}, σ) is equal to the preference value of {σ} which is equal to the preference of σ, as it contains only a single proposition. Thus, Pref = 0.8 Pref Pref (0⊥ | 2) , (0⊥ | 2) ⊥1 ⊥1 (0⊥ 3 | 2) , (0⊥ 3 | 2) ⊥ ⊥ = 0.7 (0⊥ | {1, 2}) , (0⊥ | {1, 2}) ⊥3 ⊥3 = 0.5. The preference of argument (18) equals the preference of the set of its antecedents, which, according to Eq. (12), is equal to the product of their individual preferences, that is, Pref (0⊥ | 2), (0⊥ 3 | 2) , (0⊥ 3 | {1, 2}) ⊥1 ⊥ ⊥ = 0.8 × 0.7 = 0.56. Proposition (15) and rule (16) contradict each other logically, that is, their corresponding arguments (17) and (18) defeat each other. However, argument (18) is not attacked as its preference is 0.56 which is larger than 0.5, the preference of its defeater argument (17). Since no other argument defeats (18), it is acceptable, and (17), being attacked by an acceptable argument, must be rejected. We therefore see that using preferences the inconsistency of Example 1 has been resolved in favor of rule (16). Let us now illustrate the defend relation, that is, how an argument can be defended by some other argument. The example also illustrates an alternative resolution for the inconsistency of Example 1, this time in favor of the independence proposition (15). Example 4. Let us extend the IKB of Example 3 with two additional independence propositions and an additional rule. The new propositions and their preference are: Pref [(0⊥ | {2, 3})] = 0.9 ⊥4 Pref [(0⊥ | {2, 4})] = 0.8 ⊥3 316 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION and the new rule is: (0⊥ | {2, 3}) ∧ (0⊥ | {2, 4}) ⊥4 ⊥3 =⇒ (0⊥ | 2). ⊥3 This rule is an instance of the Intersection axiom followed by Decomposition. The corresponding arguments and preferences are: Pref (0⊥ | {2, 3}) , (0⊥ | {2, 3}) ⊥4 ⊥4 = 0.9 Pref (0⊥ | {2, 4}) , (0⊥ | {2, 4}) ⊥3 ⊥3 = 0.8 corresponding to the two propositions, and Pref (0⊥ | {2, 3}), (0⊥ | {2, 4}) , (0⊥ | 2) ⊥4 ⊥3 ⊥3 = 0.9 × 0.8 = 0.72 (19) corresponding to the rule. As in Example 3, argument (17) is attacked by argument (18). Let us represent this graphically using an arrow from argument a to argument b to denote that a attacks b, that is, Argument (18) −→ Argument (17). If the IKB was as in Example 3, (18) would had been accepted and (17) would have been rejected. However, the additional argument (19) now defeats (undercuts) (18) by logically contradicting its antecedent (0⊥ 3 | 2). Since the preference of (19), namely 0.72, is larger than that of ⊥ (18), namely 0.56, (19) attacks (18). Therefore, (19) defends all arguments that are attacked by argument (18), and in particular (17). Graphically, Argument (19) −→ Argument (18) −→ Argument (17). Note this is not sufficient for accepting (17) as it has not been proved that its defender (19) is itself acceptable. We leave the proof of this as an exercise for the reader. 4. The Argumentative Independence Test (AIT) The independence-based preference argumentation framework described in the previous section provides a semantics for the acceptance of arguments consisting of independence propositions. However, what we need is a procedure for a test of independence that, given as input a triplet σ = (X,Y | Z) responds whether X is independent or dependent of Y given Z. In other words, we need a semantics for the acceptance of propositions, not arguments. Let us consider the two propositions related to the input triplet σ = (X,Y | Z), proposition (σ = true), abbreviated σ t , and proposition (σ = false), abbreviated σf , that correspond to independence (X⊥ | Z) and ⊥Y dependence (X ⊥ | Z) of σ, respectively. The basic idea for deciding on the independence or de⊥Y pendence of input triplet σ is to define a semantics for the acceptance or rejection of propositions σ t and σf based on the acceptance or rejection of their respective propositional arguments ({σ t }, σt ) and ({σf }, σf ). Formally, (X ⊥ | Z) is accepted ⊥Y iff ({(X ⊥ | Z)}, (X ⊥ | Z)) is accepted, and ⊥Y ⊥Y (X⊥ | Z) is accepted ⊥Y iff ({(X⊥ | Z)}, (X⊥ | Z)) is accepted. ⊥Y ⊥Y 317 (20) B ROMBERG AND M ARGARITIS Based on this semantics over propositions, we decide on the dependence or independence of triplet σ as follows: σt = (X⊥ | Z) is accepted ⊥Y =⇒ (X⊥ | Z) ⊥Y σf = (X ⊥ | Z) is accepted ⊥Y =⇒ (X ⊥ | Z). ⊥Y (21) We call the test that determines independence in this manner the Argumentative Independence Test or AIT. For the above semantics to be well-defined, a triplet σ must be either independent or dependent, that is, not both or neither. For that, exactly one of the antecedents of the above implications of Eq. (21) must be true. Formally, Theorem 18. For any input triplet σ = (X,Y | Z), the argumentative independence test (AIT) defined by Eqs. (20) and (21) produces a non-ambiguous decision, that is, it decides σ evaluates to either independence or dependence, but nor both or neither. For that to happen, one and only one of its corresponding propositions σ t or σf must be accepted. A necessary condition for this is given by the following theorem. Theorem 19. Given a PAF A , R , π with a strict and transitive preference relation π, every propositional argument ({σt }, σt ) ∈ A and its negation ({σf }, σf ) satisfy ({σt }, σt ) is accepted iff ({σf }, σf ) is rejected. The above theorem is not sufficient because the propositions may still be in abeyance, but this possibility is ruled out for strict preference relations by Theorem 20, presented in the next section. The formal proofs of Theorems 18, 19 and 20 are presented in Appendix B. We now illustrate the use of AIT with an example. Example 5. We consider an extension of Example 3 to illustrate the use of the AIT to decide on the independence or dependence of input triplet (0, 3 | {1, 2}). According to Eq. (20) the decision depends on the status of the two propositional arguments: ({(0⊥ 3 | {1, 2})}, (0⊥ 3 | {1, 2})), and ⊥ ⊥ (22) ({(0⊥ | {1, 2})}, (0⊥ | {1, 2})). ⊥3 ⊥3 (23) Argument (23) is equal to argument (17) of Example 3 that was proved to be rejected in that example. Therefore, according to Theorem 19, its negated propositional argument Eq. (22) must be accepted, and we can conclude that triplet (0, 3 | {1, 2}) corresponds to a dependence, that is, we conclude that (0⊥ 3 | {1, 2}). ⊥ 5. The Top-down AIT Algorithm We now discuss in more detail the top-down algorithm which is used to implement the argumentative independence test, introduced in Section 3. We start by simplifying the recursion of Eq. (7) that determines the state (accepted, rejected, or in abeyance) of an argument a. We then explain the algorithm and analyze its computability (i.e., prove that its recursive execution is always finite) and its time complexity. To simplify the recursion Eq. (7) we use the following theorem (proved in Appendix B). 318 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Theorem 20. Let A , R , π be a PAF with a strict preference relation π. Then no argument a ∈ A is in abeyance. This theorem reduces the number of states of each argument to two, that is, an argument can be either accepted or not accepted (rejected). We will use the name of the argument a to denote the predicate “a is accepted” and its negation ¬a to denote the predicate “a is rejected.” With this notation, the above theorem, and the fact that we have extended the semantics of acceptability from the defeat to the attack relation (using preferences), the recursion of Eq. (7) can be expressed as follows a ⇐⇒ ∀b ∈ attackers(a), ¬b ¬a ⇐⇒ ∃b ∈ attackers(a), b or, equivalently, a ^ ⇐⇒ ¬b b∈attackers(a) ¬a _ ⇐⇒ b. b∈attackers(a) Finally, we notice that the second formula is logically equivalent to the first (simply negating both sides of the double implication recovers the first). Therefore, the Boolean value of the dialog tree for a can be computed by the simple expression a ^ ⇐⇒ ¬b. (24) b∈attackers(a) To illustrate, consider an attacker b of a. If b is rejected, that is, ¬b, the conjunction on the right cannot be determined without examining the other attackers of a. Only when all attackers of a are known to be rejected can the value of a be determined, that is, accepted. Instead, if b is accepted, that is, b, the state of ¬b is false and the conjunction can be immediately evaluated to false, that is, a is rejected regardless of the acceptability of any other attackers. An iterative version of the top-down algorithm is shown in Algorithm 3. We assume that the algorithm can access a global PAF A , R , π , with arguments in A defined over a knowledge base K = Σ, Ψ . Given as input a triplet t = (X,Y | Z), if the algorithm returns true (false) then we conclude that t is independent (dependent). It starts by creating a root node u for the propositional argument U of proposition t = true (lines 1–6). According to Eqs. (20) and (21), the algorithm then decides true if U is accepted (line 22). Otherwise, the algorithm returns false (line 23). This is because in this case, according to Theorem 19, the negation of propositional argument U must be accepted. Algorithm 3 is an iterative version of a tree traversal algorithm. It maintains a queue of the nodes that have not been expanded yet. A node is expanded when its children are added to the tree. In the algorithm, this is done in the loop of lines 17 to 21, which uses subroutine getAttackers of Algorithm 5 to obtain all attackers of an argument. This subroutine finds all attackers of the input argument a in a backward-chaining fashion, that is, given an argument a = (H, h), it searches for all rules in the knowledge base K whose consequent matches the negation of some proposition in the 319 B ROMBERG AND M ARGARITIS Algorithm 3 independent(triplet t). 1: f true ← proposition (t = true) /* Creates independence proposition (t = true). */ 2: Utrue ← ({ f true }, f true ) 3: utrue ← node for argument Utrue 4: utrue .parent ← nil 5: u.STATE ← nil 6: f ringe ← [u] /* Initialize with u (root). */ 7: /* Create global rejected node, denoted by ρ. */ 8: ρ ← node with no argument and state rejected 9: while f ringe = ∅ do 10: u ← dequeue( f ringe) 11: attackers ← getAttackers(u.argument) 12: if (attackers = ∅) then 13: u.STATE ← accepted 14: if sendMsg(ρ, u) = terminate then break 15: attackers ← sort attackers in decreasing order of preference. 16: /* Enqueue attackers after decomposing them. */ 17: for each A ∈ attackers do 18: a ← node for argument A 19: a.parent ← u 20: a.STATE ← nil 21: enqueue a in f ringe /* See details in text. */ 22: if (u.STATE = accepted) then return true 23: if (u.STATE = rejected) then return false Algorithm 4 sendMsg(Node c, Node p). 1: /* Try to evaluate node p given new information in c.STAT E */ 2: if p = nil then 3: if c.STATE = accepted then p.STATE ← rejected 4: else if (∀ children q of p, q.STATE = rejected) then p.STATE ← accepted 5: /* If p was successfully evaluated, try to evaluate its parent by sending message upward. */ 6: if p.STATE = nil then 7: return sendMsg(p, p.parent) 8: else 9: return continue 10: else 11: return terminate /* The root node has been evaluated. */ support H (undercutters), or the negation of its head h (rebutters). Every node maintains a threevalued state variable STATE ∈ {nil, accepted, rejected}. The nil state denotes that the value of the node is not yet known, and a node is initialized to this state when it is added to the tree. The algorithm recurses down the dialog tree until a node is found that has no attackers (line 12). Such a node is accepted in line 13, that is, the conjunction of Eq. (24) is trivially true, and its STATE is propagated upwards toward the root to the parent using subroutine sendMsg (Algorithm 4). Every 320 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Algorithm 5 Finds all attackers of input argument a in knowledge base K = Σ, Ψ : getAttackers(a = (H, h)) 1: attackers ← ∅ 2: /* Get all undercutters or rebutters of a. */ 3: for all propositions ϕ ∈ H ∪ {h} do 4: /* Get all defeaters of proposition ϕ. */ 5: for all rules (Φ1 ∧ Φ2 . . . ∧ Φn =⇒ ¬ϕ) ∈ Ψ do 6: /* Find all propositionalizations of the rule whose consequent matches ¬ϕ. */ 7: for all subsets {ϕ1 , ϕ2 . . . , ϕn } of Σ s.t. Φ1 ≡ ϕ1 , Φ2 ≡ ϕ2 . . . Φn ≡ ϕn do 8: d ← ({ϕ1 ∧ ϕ2 . . . ϕn }, ¬ϕ) /* Create defeater. */ 9: /* Is the defeater an attacker? */ 10: if ¬(a π d) then 11: attackers ← attackers ∪ {d} 12: return attackers time a node receives a message from a child, if the message is accepted, the node is rejected (line 3 of Algorithm 4), otherwise the node is accepted if all its children has been evaluated to rejected (line 4 of Algorithm 4). The subroutine sendMsg then proceeds recursively by forwarding a message to the parent whenever a node has been evaluated (line 7). If the root is reached and evaluated, the message is sent to its parent, which is nil. In this case, the subroutine returns the special keyword terminate back to the caller, indicating that the root has been evaluated and thus the main algorithm (Algorithm 3) can terminate. The caller can be either the subroutine sendMsg, in which case it pushes the returned message up the method-calling stack, or the top-down algorithm in line 14, in which case its “while” loop is terminated. An important part of the algorithm is yet underspecified, namely the order in which the attackers of a node are explored in the tree (i.e., the priority with which nodes are enqueued in line 21). This affects the order of expansion of the nodes in the dialog tree. Possible orderings are depth-first, breadth-first, iterative deepening, as well as informed searches such as best-first when a heuristic is available. In our experiments we used iterative deepening because it combines the benefits of depth-first and breadth-first search, that is, small memory requirements on the same order as depthfirst search (i.e., on the order of the maximum number of children a node can have) but also the advantage of finding the shallowest solution like breadth-first search. We also used a heuristic for enqueuing the children of a node. According to iterative deepening, the position in the queue of the children of a node is specified relative to other nodes, but not relative to each other. We therefore specified the relative order of the children according to the value of the preference function: children with higher preference are enqueued first (line 15 of the top-down algorithm), and thus, according to iterative deepening, would be dequeued first. 5.1 Computability of the Top-Down Algorithm An open question is whether the top-down algorithm is computable, that is, whether it always terminates. In this section we prove that it is. To prove this we need to show that under certain general conditions the acceptability of an argument a can always be determined, that is, that the algorithm always terminates. This is proved by the following theorem. 321 B ROMBERG AND M ARGARITIS Theorem 21. Given an arbitrary triplet t = (X,Y | Z), and a PAF A , R , π with a strict preference relation π, Algorithm 3 with input t over A , R , π terminates. The proof consists on showing that the path from the root a to any leaf is always finite. For that, the concept of an attack sequence is needed. Definition 22. An attack sequence is a sequence a1 , a2 , . . . , an of n arguments such that for every 2 ≤ i ≤ n, ai attacks ai−1 . By the manner in which the top-down algorithm constructs the dialog tree it is clear that any path from the root to a leaf is an attack sequence. It therefore suffices to show that any such sequence is finite. This is done by the following theorem. Theorem 23. Every attack sequence a1 , a2 , . . . , an in a PAF A , R , π with strict π and finite A is finite. Intuitively, if the preference relation is strict then an element can attack its predecessor in the sequence but not vice versa. Since the set of arguments A is finite, the only way for an attack sequence to be infinite is to contain a cycle. In that case, an argument would be attacking at least one of its predecessors, which cannot happen in a PAF with a strict preference relation. We present formal proofs of Theorems 21 and 23 in Appendix A. We thus arrived at the important conclusion that, under a strict preference function and a finite argument set, the state of any argument is computable. As we showed in Section 3.3, the preference function for independence knowledge bases is strict, and thus the computability of the top-down algorithm is guaranteed. 5.2 Computational Complexity of the Top-Down Algorithm Since Algorithm 3 is a tree traversal algorithm, its time complexity can be obtained by techniques contained in standard algorithmic texts, for example, Cormen et al. (2001). The actual performance depends on the tree exploration procedure. In our case we used iterative deepening due to its linear memory requirements in d, where d is the smallest depth at which the algorithm terminates. Iterative deepening has a worst-time time complexity of O(bd ), where b is an upper bound on the dialog tree branching factor. Therefore, for a constant b > 1 the execution time is exponential in d in the worst case. Furthermore, for the case of independence tests, b itself may also be exponential in n (the number of variables in the domain). This is because the inference rules of Eqs. (5) and (6) are universally quantified, and therefore their propositionalization (lines 7–11 of Algorithm 5), may result in an exponential number of rules with the same consequent (attackers). A tighter bound may be possible but, lacking such a bound, we introduce in the next section an approximate top-down algorithm, which reduces the running time to polynomial. As we show in our experiments, the use of this approximation does not appreciably affect the accuracy improvement due to argumentation. 6. The Approximate Top-Down AIT Algorithm As the top-down algorithm has a theoretically exponential running time in the worst case, we hereby present a practical polynomial-time approximation of the top-down algorithm. We make use of two approximations: (a) To address the exponential behavior due to the depth of the dialog tree we apply a cutoff depth d for the process of iterative deepening. (b) To address the potentially 322 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION exponential size of the branching factor b (which equals the maximum number of defeaters of any argument appearing in the dialog tree) we limit the number of defeaters of each node—thus bounding the number of its attackers/children—to a polynomial function of n (the domain size) during the propositionalization process of Algorithm 5 (lines 7–11). Let (H, h) be an argument and let ϕ ∈ H ∪ {h} be one of its propositions, as in line 3 of Algorithm 5. The set of attackers Σ ϕ of (H, h) consists of all rules {ϕ1 ∧ ϕ2 . . . ∧ ϕk =⇒ ¬ϕ} of Σ, for some constant upper bound k on the size of their support. If ϕ = (X, Y | Z) and ϕi = (Xi , Yi | Zi ) for all 1 ≤ i ≤ k, then our approximation generates and uses a subset of Σϕ in the dialog tree such that |X| − c ≤ |Xi | ≤ |X| + c |Y| − c ≤ |Yi | ≤ |Y| + c |Z| − c ≤ |Zi | ≤ |Z| + c (25) where | · | denotes set cardinality, and c is a user-specified integer parameter that defines the approximation. We call this the approximate top-down algorithm. The computational complexity of the approximate top-down algorithm is polynomial in n, as shown in the next section. 6.1 Test Complexity of the Approximate Top-Down Algorithm In this section we prove that the number of statistical tests required by the Approximate Top-Down algorithm is polynomial in n. As described in the previous section, the approximate algorithm generates a bounded number of attackers for each proposition in the argument corresponding to some node in the dialog tree. A bound on the number of the possible attackers can be defined by the approximation of Eq. (25). These equations dictate that the size of each possible set X i in some proposition (Xi , Yi | Zi ) of some attacker of proposition (X, Y | Z) is between |X| + c and |X| − c (inclusively). As the number of elements that can be members of X i is bounded by n (the domain size), this produces at most n2c+1 possible instantiations for set Xi . Similarly, the number of possible instantiations for Yi and Zi is also n2c+1 . Therefore, an upper bound for the number of matches to some proposition in the antecedent of an attacking rule is O(n 6c+3 ) for some constant c. As there are r rules in the rule set and up to k propositions in each rule for some constants r and k (for example, r = 5 and k = 3 for Eq. (5) and r = 8 and k = 4 for Eq. (6)), an upper bound on the number of children of a node in the dialog tree is O(rkn6c+3 ), and thus an upper bound on the number of nodes in the dialog tree of depth d is O((rk)d nd(6c+3) ). As we demonstrate in our experiments, this is a rather loose upper bound and the performance of the approximate top-down algorithm is reasonable in practice, but it does serve to show that the theoretical worst-case performance is polynomial in n. In the experiments shown in the next section we used c = 1 and d = 3. 7. Experimental Results We conducted experiments on sampled and real-world data sets for the purpose of (a) evaluating the accuracy improvement of the argumentative test (both the exact and approximate versions) over its statistical counterpart; (b) demonstrating the performance improvements that can be achieved by the approximate version compared to the exact counterpart, without significant reduction in accuracy improvement; and (c) evaluating the improvements that result by the use of the argumentative framework for causal discovery. We address these issues below. 323 B ROMBERG AND M ARGARITIS 7.1 Comparative Evaluation of Bottom-Up, Exact Top-Down, and Approximate Top-Down Argumentative Tests In this section we demonstrate that the argumentation approach, implemented either by the (exact) bottom-up or the exact top-down algorithm (Algorithm 3), improves the accuracy of independence tests on small data sets. We also show that the approximate top-down algorithm (see Section 6) has accuracy performance improvements similar to its exact counterpart but significantly better execution times (orders of magnitude), that make it more practical and usable for larger domains. As the output of the bottom-up algorithm is guaranteed to be equal to the exact top-down algorithm as Theorem 9 of Section 3, we omit accuracy results for the bottom-up algorithm here. As the exact algorithm is impractical for large domains, for the present comparison we sampled data sets from two randomly generated Bayesian networks with n = 8 nodes. The networks were generated using BNGenerator (Ide et al., 2002), a publicly available Java package, with maximum degree per node τ equal to 3 and 7 to evaluate the performance in sparsely as well as densely connected domains. A large data set D was sampled from each network and our experiments were conducted on subsets of it containing an increasing number of data points N. This was done in order to assess the accuracy on varying conditions of reliability, as the reliability of a test varies (typically increases) with the amount of data available. To reduce variance, each experiment was repeated for ten data subsets of equal size, obtained by permuting the data points of D randomly and using the first N of them as input to our algorithms. We first compare the accuracy of argumentative tests versus their purely statistical counterparts (i.e., the G2 test) on several data sets sampled from randomly generated Bayesian networks. Sampled data experiments have the advantage of a more precise estimation of the accuracy since the underlying model is known. We present experiments for two versions of the exact top-down argumentative test, one using Pearl’s general axioms of of Eq. (5), denoted AIT t -G, and another that uses Pearl’s “directed” axioms of Eq. (6), denoted AITt -D, as well as two versions of the approximate top-down argumentative test, denoted AITt -G and AITt -D respectively. We abbreviate purely statistical independence tests as SIT. We report the estimated accuracy, which, for each data set, is calculated by comparing the result of a number of conditional independence tests (SITs or AITs) on data with the true value of independence, computed by querying the underlying model for the conditional independence of the same test using d-separation. Since the number of possible tests is exponential, we estimated the independence accuracy by randomly sampling a set T of 1,000 triplets (X,Y, Z), evenly distributed among all possible conditioning set sizes m ∈ {0, . . . , n − 2}, that is, 1000/(n − 1) tests for each m. The independence or dependence value of the triplets (in the true, underlying model) were not controlled, but left to be decided randomly. This resulted in a non-uniform distribution of dependencies and independences. For instance, in the experiments shown next (n = 8, τ = 3, 7), the average proportion of independences vs. dependencies was 36.6% to 63.4% respectively for τ = 3, and 11.4% to 88.6% respectively for τ = 7. Denoting a triplet in T by t, by Itrue (t) the result of a test on t performed on the underlying model, and by Idata-Y (t) the results of performing a test on t of type Y on data, for Y equal to SIT, AITt -G, AITt -D, AITt -G, or AITt -D, the estimated accuracy of test type Y is defined as accdata = Y 1 |T | t ∈ T | Idata-Y (t) = Itrue (t) . 324 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, general axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 7, general axioms 100 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 80 70 Accuracy (%) 70 Accuracy (%) Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 60 50 40 60 50 40 30 30 20 20 10 10 0 0 10 100 1000 10000 10 100 1000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, directed axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 7, directed axioms 100 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 70 Accuracy (%) 70 Accuracy (%) 10000 Data set size N (number of data points) 60 50 40 60 50 40 30 30 20 20 10 10 0 0 10 100 1000 10000 Data set size N (number of data points) 10 100 1000 10000 Data set size N (number of data points) Figure 2: Accuracy comparison of statistical tests (SIT) vs. exact and approximate argumentative tests for domain size n = 8 and maximum degree per node τ = 3, 7. The histograms show the absolute value of the accuracy while the line curves show the difference between SIT and the argumentative tests. 95% confidence intervals are also shown for the line graphs. Top row: General axioms. Bottom row: Directed axioms. Figure 2 (top row) shows a comparison of the SIT with the exact and approximate top-down argumentative test over the general axioms for data set with increasing number of data points. The figure shows two plots for τ = 3, 7 of the mean values (over runs for ten different data sets) of accdata , SIT accdatat -G , and accdata -G (histograms) and the difference between the accuracies of the AIT tests AIT AITt and the statistical one (line graphs) for various data set sizes N. A positive value of the difference corresponds to an improvement of the argumentative test over SIT. The plots also show the statistical significance of this difference with 95% confidence intervals (error bars), that is, the interval around the mean value that has a 0.95 probability of containing the true difference. The figure demonstrates that there exist modest but consistent and statistically significant improvements in the accuracy of both the exact and approximate argumentative tests over the statistical test. We can observe improvements over the entire range of data set sizes in both cases with maximum improvements of up to 9% and 6% for the exact and approximate cases respectively (at τ = 3 and N = 600). In certain situations where the experimenter knows that the underlying distribution belongs to the class of Bayesian networks, it is appropriate to use the specific axioms of Eq. (6) instead of the general axioms of Eq. (5). The bottom row of Figure 2 presents the same comparison as the top 325 B ROMBERG AND M ARGARITIS row but for the exact and approximate argumentative tests AIT t -D and AITt -D that use the directed axioms instead of the general ones. As in the case for AIT using the general axioms, we can observe statistically significant improvements over the entire range of data set sizes in both cases. In this case however, the improvements are larger, with maximum increases in the accuracy of the exact and approximate test of up to 13% and 9% respectively (again for τ = 3 and N = 600). Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, general axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, general axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, general axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 Conditioning set size (number of variables) 1 2 3 4 5 6 Conditioning set size (number of variables) Figure 3: Accuracy comparison of SIT vs. exact (AITt -G) and approximate (AITt -G) argumentative tests over the general axioms for increasing conditioning set sizes. The six plots correspond to maximum degrees per node τ = 3, 7, and data set sizes N = 160, 900 and 5000. 326 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, directed axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, directed axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, directed axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 Conditioning set size (number of variables) 1 2 3 4 5 6 Conditioning set size (number of variables) Figure 4: Same as Figure 3 but for AIT using the directed axioms instead of the general ones. We also evaluated the accuracy of these tests for increasing conditioning set sizes. Figures 3 and 4 show a comparison of the SIT with the exact and approximate top-down argumentative test using the general and directed axioms respectively, for accuracies over increasing conditioning set size for data sizes N = 160, 900, and 5000 (different rows). We can observe statistically significant improvements over the entire range of conditioning set sizes in all twelve graphs. Except in one case (directed axioms, N = 5000, τ = 3), all graphs show an upward trend in accuracy for increasing conditioning set size, representing a positive impact of the argumentative approach that increases with a decrease in test reliability, that is, increasing conditioning set size. 327 B ROMBERG AND M ARGARITIS We also compared the execution times of the bottom-up, exact top-down and approximate topdown algorithms on the same data sets. To run the bottom-up algorithm we generated the set of all propositional arguments possible, that is, arguments of the form ({σ}, σ), by iterating over all possible triplets (X, Y | Z), and inserted them in the knowledge base together with their preference, as described in Section 3.1. Similarly, for the set of axioms that we used in each case, that is, either the general (Eq. (5)) or the specific ones (Eq. (6)), we iterated over all possible matches of each rule, inserting the corresponding (single-headed and decomposed) instantiated rule in the knowledge base again together with its preference. The reason for including all propositional and rule-based arguments in our IKB is to allow the argumentation framework to consider all possible arguments in favor of or against an independence query. We compared the bottom-up algorithm AITb , the exact top-down algorithms AITt , and the approximate top-down algorithm AITt . For this, we measured the time it takes to discover the structure of a Bayesian networks using three versions of the PC algorithm (Spirtes et al., 2000), each using one of the three argumentative tests AITb , AITt , or AITt to conduct the independence tests. As usual, we consider two versions of each test AITb , AITt , and AITt , one that uses the general axioms of Eq. (5), that is, AITb -G, AITt -G, and AITt -G, respectively, and one that uses the specific axioms of Eq. (6) (applicable to Bayesian networks), that is, AITb -D, AITt -D, and AITt -D, respectively. The data sets used are the same as the ones used in the accuracy comparisons above. Figure 5 plots the execution time of argumentative tests AITb -G vs. AITt -G vs. AITt -G (top row) and AITb -D vs. AITt -D vs. AITt -D (bottom row) for tests that were conducted by the PC algorithm while learning the structure. Note that both the x and y-axes are plotted in log-scale. We can observe improvements in the execution time of the exact top-down algorithm over that of the bottom-up algorithm of an order of magnitude over the entire range of data set sizes in all four plots. We can also see improvement of a similar order between the exact and approximate topdown argumentative algorithms. For instance, for the general axioms and τ = 3 (top-left plot), the execution time for N = 5000 is 2749 seconds for the bottom-up against 107 seconds for the exact top-down and 15 seconds for the approximate top-down algorithm. We see even more pronounced execution time improvements when using the directed axioms (bottom row of Fig. 5). The execution-time results demonstrate that the exact top-down algorithm performs significantly better than the bottom-up algorithm, while producing the exact same output (according to Theorem 9 of Section 3). This implies a clear advantage of using the top-down over the bottom-up algorithm. Furthermore, we also saw that the approximate top-down algorithm performs similarly in terms of accuracy improvement while having polynomial worst-case execution time and in practice being several orders of magnitude faster than the exact top-down algorithm, which is exponential in the worst-case. As in the next two sections we continue our evaluation on domains significantly larger than the n = 8 variables that we examined here, it would be difficult or impractical for the exact algorithms to be employed. For these reasons in the following experiments we use the more practical approximate algorithm, which can be applied to larger domains. 7.2 Causal Discovery in Larger Domains We also conducted experiments that demonstrate the performance of the approximate top-down algorithm by (a) showing its applicability to large domains, and (b) demonstrating positive improvements in accuracy of argumentative tests on the learning of the structure of Bayesian networks, the main problem faced by causal discovery algorithms. In the following experiments we used the PC 328 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 3, general axioms Bottom-up (exact) Top-down exact Top-down approximate Bottom-up (exact) Top-down exact Top-down approximate 10000 Execution time (seconds) 10000 Execution time (seconds) Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 7, general axioms 1000 100 10 1 0.1 1000 100 10 1 10 100 1000 10000 10 100 1000 10000 Data set size N (number of data points) Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 3, directed axioms Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 7, directed axioms Bottom-up (exact) Top-down exact Top-down approximate Bottom-up (exact) Top-down exact Top-down approximate 10000 Execution time (seconds) 10000 Execution time (seconds) Data set size N (number of data points) 1000 100 10 1 1000 100 10 1 10 100 1000 10000 Data set size N (number of data points) 10 100 1000 10000 Data set size N (number of data points) Figure 5: Execution time comparison for the PC algorithm when it uses the bottom-up and exact top-down and approximate top-down argumentative tests to learn the structure of a Bayesian network from data sampled from Bayesian models with domain size n = 8, maximum degrees τ = 3, 7. The bars show the absolute value of the running time using a logarithmic scale. Top row: general axioms. Bottom row: directed axioms. algorithm. We compared the true structure of the underlying model to the resulting structure of the PC algorithm when it uses SITs as independence tests, denoted PC-SIT, and its output when it uses argumentative independence tests, denoted PC-AITt -D, when using the directed axioms. We evaluated the resulting networks by their ability to accurately represent the true independences in the domain, calculated by comparing the results (true or false) of a number of conditional tests conducted using d-separation on the output networks (PC-SIT or PC- AITt -D). Denoting by T this set of 2,000 triplets, by t ∈ T a triplet, by Itrue (t) the result of a test performed on the underlying model, and by IPC-Y (t) the result of performing a d-separation test on the network output by the PC algorithm using the Y test, Y equal to SIT or AITt -D, the estimated accuracy is defined as accPC = Y 1 |T | t ∈ T | IPC-Y (t) = Itrue (t) . (26) We considered data sampled from randomly generated Bayesian networks of sizes n = 24, and maximum degrees τ = 3, 7. For each network we sampled ten data sets, and, for each data set, we 329 B ROMBERG AND M ARGARITIS Accuracy comparison of SIT and AIT when used by the PC algorithm n = 24 variables, τ = 3, directed axioms 100 90 Accuracy comparison of SIT and AIT when used by the PC algorithm n = 24 variables, τ = 7, directed axioms 100 Statistical test (SIT) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) 90 70 Accuracy (%) 80 70 Accuracy (%) 80 Statistical test (SIT) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) 60 50 40 60 50 40 30 30 20 20 10 10 0 0 100 1000 10000 25000 100 Data set size N (number of data points) 1000 10000 25000 Data set size N (number of data points) Figure 6: Comparison of statistical tests (SIT) vs. approximate argumentative tests on the directed axioms (AITt -D) for data sets sampled from Bayesian models for domain size n = 24 and maximum degrees τ = 3, 7. conducted experiments on subsets of D containing an increasing number of data points. We report the average over the ten data sets of the estimated accuracy calculated using Eq. (26), for Y = SIT or AITt -D, as well as the difference between the average accuracies including the 95% confidence interval for the difference. Execution time of the PC algorithm using approximate AIT n = 24 variables, directed axioms Execution time (seconds) τ=3 τ=7 100000 10000 1000 100 10 10 100 1000 10000 100000 Data set size N (number of data points) Figure 7: Execution times for the PC algorithm using the approximate argumentative test on the directed axioms (AITt -D) on data sets sampled from Bayesian models for domain size n = 24 and maximum degrees τ = 3, 7. For the approximate AIT test we limited the depth of the dialog tree to 3 and its the branching factor as described in Section 6. Figure 6 shows a comparison of the argumentative tests AITt -D using the directed axioms with the corresponding SIT. The figure shows two plots for different values of τ of the mean values (over runs for ten different data sets) of accPC and accPC -D (histograms), the difference between these SIT AIT t 330 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION averages (line graph), and the 95% confidence intervals for the difference (error bars), for different data set sizes N. As usual, a positive value of the difference corresponds to an improvement of AITt -D over SIT. As in practically all experiments so far, we have statistically significant improvements over the entire range of data set sizes, with maximum improvements of up to 20% for τ = 3, N = 25000, and τ = 7, N = 900. The corresponding execution times for the entire PC algorithm are shown in Fig. 7. We can make two observations from this graph. One, the cost is significantly lower for sparse domains, which benefits real-world application domains that are sparse. The second observation is that the execution time scales linearly with the number of data points; this exhibits the same behavior as the use of a SIT test in PC, as each test needs to scan the data set once to compute the contingency table and relevant test statistics. In summary, these results demonstrate that the approximate argumentative test is practical for larger domains and can result in positive, statistically significant accuracy improvements when used for causal discovery. However, the cost of AIT for large data sets, although not prohibitive, can be non-negligible. Therefore the accuracy benefits of AIT vs. a SIT must be carefully weighed off the ability of the user to expend the extra computation. Note that the practicality of the approximate algorithm also depends on the parameters used (the cutoff depth of iterative deepening and the branching factor limit—see Section 6); different parameter values or alternative ways of limiting the size of the dialog tree may be needed for even larger domains. 7.3 Real-world and Benchmark Data Experiments While the sampled data set studies of the previous section have the advantage of a more controlled and systematic study of the performance of the algorithms, experiments on real-world data are necessary for a more realistic assessment. In this section we present experiments on a number of real-world and benchmark data sets obtained from the UCI machine learning repository (D. J. Newman and Merz, 1998) and the Knowledge Discovery Data repository (Hettich and Bay, 1999). As in the sampled data case of the previous section, for each data set D, we conducted experiments on subsets of D containing an increasing number of data points N to assess the performance of the independence tests on varying conditions of reliability. Again, to reduce variance we repeated each experiment ten times, each time choosing a different randomly selected data subset of equal size. Because for real-world data sets the underlying model is unknown, we could only be sure the general axioms of Eq. (5) apply. We therefore only used these axioms in this section. Also, as mentioned in the previous section, because some of the data sets have much larger domains (e.g., the alarm data set contains 37 variables), and given the exponential nature of the exact algorithms we could only perform experiments for the approximate version of the argumentative test. For these reasons, in the following experiments we only report the accuracy of AITt -G, the approximate argumentative independence test defined over the general axioms. Unfortunately, for real-world data the underlying model is typically unknown and therefore it is impossible to know the true value of any independence. We therefore approximate it by a statistical test on the entire data set, and limit the size of the data set subsets that we use up to a third of the size of the entire data set. This corresponds to the hypothetical scenario that a much smaller data set is available to the researcher, allowing us to evaluate the improvement of argumentation under these more challenging situations. Again, as in the previous two sections, for comparison we sampled 2,000 triplets and calculated the accuracy as a fraction of tests correct, where for the true value of independences and dependences we used the method just described. 331 B ROMBERG AND M ARGARITIS Accuracy comparison of SIT and AIT on real-world data Data set: cmc, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: letterRecognition, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 7 Accuracy (%) 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 8 Accuracy (%) 9 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 10 100 1000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on real-world data Data set: alarm, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: nursery, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 7 Accuracy (%) 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 8 Accuracy (%) 9 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 10000 10 Data set size N (number of data points) 100 1000 10000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on real-world data Data set: car, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: flare2, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 8 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 7 Accuracy (%) 9 Accuracy (%) 10000 Data set size N (number of data points) 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 Data set size N (number of data points) 10 100 1000 Data set size N (number of data points) Figure 8: Difference in the mean value of the accuracy AITt -G with the mean value of the accuracy of SIT for a number of real-world data sets. The error bars denote the 95% confidence interval of the difference. Figure 8 and Table 1 show the result of our comparison between the argumentative test AITt -G and statistical test SIT for real-world data sets. In the table, the best-performing method is shown in bold. The figure contains 6 plots, one for each data set, depicting the difference between the mean value of the accuracy of AITt -G and that of SIT, where as usual a positive value denotes an improvement of AITt -G over SIT. While in a few cases the average difference is negative (e.g., data set cmc, N = 40), in each case the negative value is not statistically significant as the confidence in332 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION N = 40 N = 240 N = 600 N = 1200 N = 3500 Data set car Domain size 7 Data set size 1730 SIT 80.1 80.1 AITt -G AITt -G − SIT 0.0 ± 0.7 Runtime of AITt -G (ms) 0.56 SIT 86.7 AITt -G 88.6 1.9 ± 0.6 AITt -G − SIT Runtime of AITt -G (ms) 1.37 SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) cmc 10 1475 77.8 77.5 -0.3 ± 0.6 1.07 84.1 84.7 0.5 ± 1.2 5.19 flare2 letterRecognition nursery 13 17 9 1067 20002 12962 77.0 47.9 83.3 77.1 47.8 83.8 0.1 ± 0.3 -0.1 ± 0.2 0.4 ± 0.1 2.61 4.19 0.88 85.5 50.7 86.1 86.9 51.0 87.2 1.3 ± 0.8 0.2 ± 0.4 1.1 ± 0.4 8.73 90.50 1.84 55.8 88.5 57.3 89.3 1.5 ± 0.5 0.8 ± 0.1 575.53 4.37 63.3 89.7 64.3 91.2 1.0 ± 0.3 1.5 ± 0.3 2008.76 14.05 73.8 94.1 76.5 95.4 2.6 ± 0.7 1.3 ± 0.3 24540.51 76.48 alarm 37 20003 76.7 76.7 0.0 ± 0.4 52.06 84.3 85.1 0.8 ± 0.3 202.05 88.6 89.8 1.2 ± 0.4 547.77 90.8 92.0 1.2 ± 0.4 1151.05 95.2 96.3 1.1 ± 0.3 3895.2 Table 1: Average accuracies (in percentage) of SIT and AITt -G, their differences (denoted AITt -G − SIT in the table), the 95% confidence interval for the difference, and the average runtime per test (in ms) for AITt -G for several real-world and benchmark data sets. For each data set the table shows these quantities for number of data points N = 40, 240, 600, 1200, 3500. The best performing algorithm ( AITt -G or SIT, with respect to accuracy) is indicated in bold. Empty cells correspond to cases where one third of the data set was smaller than the value of N in that column. terval contains a portion of the positive half-plane. The figure demonstrates a clear advantage of the argumentative approach, with all data sets reaching statistically significant positive improvements in accuracy of up to 3% and all confidence intervals covering positive values either partially or completely. The table also shows the average execution time (in ms) for the AITt -G tests evaluated. 8. Conclusion We presented a framework for addressing one of the most important problems of independencebased structure discovery algorithms, namely the problem of unreliability of statistical independence tests. Our main idea was to recognize the existence of interdependences among the outcomes of conditional independence tests—in the form of Pearl’s axiomatic characterization of the conditional independence relation—that can be seen as integrity constraints and exploited to correct unreliable statistical tests. We modeled this setting as a knowledge base containing conditional independences that are potentially inconsistent, and used the preference-based argumentation framework to reason with and resolve these inconsistencies. We presented in detail how to apply the argumentation framework to independence knowledge bases and how to compute the preference among the independence propositions. We also presented a number of algorithms, both exact and approximate, for implementing statistical testing using this framework. We analyzed the approximate algorithm 333 B ROMBERG AND M ARGARITIS and proved that is has polynomial worst-case execution time. We also experimentally verified that its accuracy improvement is close to the exact one while providing orders of magnitude faster execution, making possible its use for causal discovery in large domains. Overall, our experimental evaluation demonstrated statistically significant improvements in the accuracy of causal discovery for the overwhelming majority of sampled, benchmark and real-world data sets. Appendix A. Computability of the Argumentative Independence Test In this appendix we prove that the argumentative test terminates, a property that we call its computability. Some of the theorems and lemmas presented are not original work but adaptations of well known properties of relations. We include them to allow a complete exposition of the proof of computability, given by Theorem 21. We first introduce some notation. We denote independence propositions (e.g., (X⊥ | Z)) by σ and their negation (e.g., (X ⊥ | Z)) by ¬σ. We abbreviate ⊥Y ⊥Y their corresponding propositional arguments ({σ}, σ) and ({¬σ}, ¬σ) by a σ and a¬σ , respectively, and we will refer to a¬σ as the negation of aσ (and vice versa). Also, we use the predicates A(a), R(a), Ab(a) to denote the fact the argument a is accepted, rejected, or in abeyance, respectively. For completeness we repeat here the definition of strict and transitive preference relation. Definition 13. We say preference relation π over arguments is strict if the ordering of arguments induced by it is strict and total, that is, for every pair of arguments a and b, a π b ⇐⇒ ¬ b π a . (27) Definition 14. We say that preference relation π over arguments is transitive if, for every three arguments a, b and c, a π b ∧ b π c =⇒ a π c . Lemma 24. A strict preference relation π satisfies the condition that for every pair of arguments such that a defeats b and b defeats a, it is the case that a attacks b or b attacks a, that is, at least one of a and b attacks the other. Proof We prove by contradiction: Let us assume that a defeats b and b defeats a but neither a attacks b nor b attacks a. By definition of the attack relation (Definition 15), ¬ a attacks b =⇒ ¬ ¬(b π a) =⇒ b π a ¬ b attacks a =⇒ ¬ ¬(a π b) =⇒ a π b. and However, this is a contradiction since, by assumption, the preference ordering is strict, and therefore it cannot be true that both a π b and b π a are true at the same time. Lemma 25. A strict preference π satisfies the condition that for every pair a and b of arguments, it is not the case that both a attacks b and b attacks a, that is, there can be no mutual attack. Proof We prove by contradiction. Let us consider two mutually attacking arguments a and b. By the definition of the attack relation, and because π is a total order, we have that a attacks b =⇒ ¬(b π a) =⇒ a 334 π b ∨ a ≡π b I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION and b attacks b =⇒ ¬(a π b) =⇒ b π a ∨ b ≡π a where a ≡π b means a is equally preferable to b. However, equality of preference is not possible in a strict preference relation. Therefore it must be the case that a π b and b π a, which is a contradiction of Eq. (27), again due to strictness. We next prove that no argument is in abeyance if the preference relation over arguments is strict. For that, we first prove that an argument in abeyance is always attacked by at least another argument in abeyance. Lemma 8. For every argument a, Ab(a) =⇒ ∃b ∈ attackers(a), Ab(b). Proof By definition, an argument a is in abeyance if it is neither accepted nor rejected. Applying the definitions of acceptance and rejection and manipulating the Boolean formulae we obtain, ⇐⇒ ¬A(a) ∧ ¬R(a) ⇐⇒ ¬ ∀b ∈ attackers(a), R(b) ∧ ¬ ∃b ∈ attackers(a), A(b)) ⇐⇒ ∃b ∈ attackers(a), ¬R(b) ∧ ∀b ∈ attackers(a), ¬A(b)) ⇐⇒ ∃b ∈ attackers(a), (A(b) ∨ Ab(b)) ∧ ∀b ∈ attackers(a), ¬A(b)) ⇐⇒ Ab(a) ∃b ∈ attackers(a), Ab(b) ∧ ∀b ∈ attackers(a), ¬A(b)) =⇒ ∃b ∈ attackers(a), Ab(b). Definition 22. An attack sequence is a sequence a1 , a2 , . . . , an of n arguments such that for every 2 ≤ i ≤ n, ai attacks ai−1 . Lemma 26. Let A , R , π be a PAF with a strict and transitive preference relation π. Then, no argument can appear more than once in any attack sequence, that is, for every attack sequence a1 , a2 , . . . , an and every pair of integers i, j ∈ [1, n] such that i = j, ai = a j . Proof We first note that by definition of the attack relation, it must be the case that for any two consecutive arguments ai , ai+1 , it is true that ¬(ai π ai+1 ). Since π is strict, this is equivalent to ai+1 π ai (c.f. Eq. (27)). That is, an π an−1 π ... π a2 π a1 (28) We now assume, for contradiction, there exists an argument a that appears twice in the attack sequence at indexes i and j , that is, ∃ i , j ∈ [1, n], i = j , such that ai = a j = a . 335 B ROMBERG AND M ARGARITIS Since no argument defeats itself, it cannot attack itself, and thus the smallest possible attack sequence with a repeated argument must have at least length 3. From this fact, Eq. (28), and transitivity, there must exist an argument b = a such that a π b π a . This last fact implies that a b and b π a must hold, which contradicts strictness (Eq. (27)). π A corollary of this lemma is the following theorem. Theorem 23. Every attack sequence a1 , a2 , . . . , an in a PAF A , R , π with strict and transitive π, and finite A is finite. Proof Follows directly from Lemma 26 and the fact that A is finite. We can now prove the main result of this section in the following theorem. Theorem 21. Given an arbitrary triplet t = (X,Y | Z), and a PAF A , R , π with a strict and transitive preference relation π, and finite arguments set A , the top-down algorithm of Algorithm 3 run for input t over A , R , π terminates. Proof In the tree traversed by the top-down algorithm, any path from the root to a leaf is an attack sequence. Since for strict and transitive π, and finite A each such sequence is finite, the algorithm always terminates. Appendix B. Validity of the Argumentative Independence Test In this section we prove the property of the argumentative independence test of deciding that an input triplet (X,Y | Z) evaluates to either independence or dependence, but not both or neither. We call this property the validity of the test. We start we proving that under the assumption of a strict and transitive preference relation, no argument is in abeyance. Theorem 20. Let A , R , π be a PAF with a strict and transitive preference relation π. Then no argument a ∈ A is in abeyance. Proof Let us assume, for contradiction, that there is an argument a in abeyance. From Lemma 8, not only a has an attacker in abeyance, say argument b, but b also has an attacker in abeyance, and so on. That is, we can construct an attack sequence starting at a that contains only arguments in abeyance. Moreover, this sequence must be infinite, since the lemma assures as we always have at least one attacker in abeyance. This is in direct contradiction with Theorem 23. Corollary 27. For every argument a in a PAF A , R , π with strict and transitive π, A(a) ⇐⇒ ¬R(a). We now prove a number of lemmas that hold only for the sub-class of propositional arguments (arguments whose support contains only one proposition, equal to the head of that argument). We start with a lemma that demonstrates that it cannot be the case that an attacker of a propositional argument aσ and an attacker of its negation a¬σ do not attack each other. The former must attack the latter or vice versa. 336 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Lemma 28. Let A , R , π be a PAF with a strict preference relation π, a σ ∈ A be a propositional argument, and a¬σ its negation. For every pair of arguments b and c that attacks a σ and a¬σ respectively, (b attacks c) ∨ (c attacks b). Proof Since aσ and a¬σ are propositional arguments, their support contains the head and only the head, and thus any defeater (i.e., rebutter or undercutter) must have as head ¬σ and σ, respectively, that is, the head of b must be ¬σ and the head of c must be σ. Thus, b rebuts (and thus defeats) c and vice versa. The lemma then follows directly from Lemma 24. Lemma 29. Let A , R , π be a PAF with a strict preference relation π, and a σ and a¬σ be a propositional argument and its negation. Then, R(aσ ) =⇒ ¬R(a¬σ ). Proof By assumption, R(aσ ). We assume, for contradiction, that R(a¬σ ). Therefore, by the definition of rejection, ∃b ∈ attackers(aσ ) such that A(b), and ∃c ∈ attackers(a¬σ ) such that A(c). By Lemma 28 b attacks c or c attacks b. In either case, an accepted argument is attacking an accepted argument, which contradicts the definition of acceptance. Lemma 30. Given a PAF A , R , π with a strict preference relation π, every propositional argument aσ ∈ A satisfies A(aσ ) =⇒ ¬A(a¬σ ) Proof We prove by contradiction. Let us assume that both a σ and a¬σ are accepted. Since aσ and a¬σ are propositional arguments, they defeat each other. Then, by Lemma 24 a σ attacks a¬σ or vice versa. In either case an accepted argument has an accepted attacker, which is a contradiction. We now prove Theorem 19 that was introduced in Section 4, reproduced here for convenience. Theorem 19. Given a PAF A , R , π with a strict and transitive preference relation π, every propositional argument aσ ∈ A and its negation a¬σ satisfy A(aσ ) ⇐⇒ R(a¬σ ). Proof The ( =⇒ ) direction follows from Lemma 30 and Theorem 20. The (⇐=) direction follows from Lemma 29 and Theorem 20. In Section 4 we defined the following semantics for deciding on the dependence or independence of an input triplet (X,Y | Z): ({(X ⊥ | Z)}, (X ⊥ | Z)) is accepted ⇐⇒ (X ⊥ | Z) is accepted =⇒ (X ⊥ | Z) ⊥Y ⊥Y ⊥Y ⊥Y ({(X⊥ | Z)}, (X⊥ | Z)) is accepted ⇐⇒ (X⊥ | Z) is accepted =⇒ (X⊥ | Z) ⊥Y ⊥Y ⊥Y ⊥Y (29) where acceptance is defined over an independence-based PAF as defined in Section 3.3. For this argumentative test of independence to be valid, its decision must be non-ambiguous, that is, it must decide either independence or dependence, but not both or neither. For that, exactly one of the antecedents of the above implications must be true. Formally: 337 B ROMBERG AND M ARGARITIS Theorem 18. For any input triplet σ = (X,Y | Z), the argumentative independence test defined by Eq. (29) produces a non-ambiguous decision, that is, it decides that σ evaluates to either independence or dependence, but not both or neither. Proof Let us denote (X⊥ | Z) by σt and (X ⊥ | Z) by σf . Since strictness and transitivity of ⊥Y ⊥Y the independence preference relation hold (proved in Section 3.3, lemmas 16 and 17 respectively), Theorems 19 and 20 hold as well. From Theorem 20 we know that neither of the propositional arguments is in abeyance. Thus, since aσt corresponds to the negation of aσf it follows from Theorem 19 that exactly one of them is accepted. References ˇ a H. Abdi. The Bonferonni and Sid´ k corrections for multiple comparisons. In Neil Salkind, editor, Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage, 2007. A. Agresti. Categorical Data Analysis. Wiley, 2nd edition, 2002. L. Amgoud and C. Cayrol. A reasoning model based on the production of acceptable arguments. Annals of Mathematics and Artificial Intelligence, 34:197–215, 2002. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57 (1):289–300, 1995. Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165–1188, 2001. W. G. Cochran. Some methods of strengthening the common χ 2 tests. Biometrics, 10:417–451, 1954. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2nd edition, 2001. C. L. Blake D. J. Newman, S. Hettich and C. J. Merz. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science, 1998. A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society, 41:1–31, 1979. P. M. Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Artificial Intelligence, 77:321–357, 1995. P. G¨ rdenfors. Belief Revision. Cambridge Computer Tracts. Cambridge University Press, Cama bridge, 1992. P. G¨ rdenfors and H. Rott. Belief revision. In Gabbay, D. M., Hogger, C. J. and Robinson, J. a A., editors, Handbook of Logic in Artificial Intelligence and Logic Programming, volume 4. Clarendon Press, Oxford, 1995. 338 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION S. Hettich and S. D. Bay. The UCI KDD archive. Irvine, CA: University of California, Department of Information and Computer Science, 1999. Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4): 800–802, December 1988. J. S. Ide, F. G. Cozman, and F. T. Ramos. Generating random Bayesian networks with constraints on induced width. Brazilian Symposium on Artificial Intelligence, Recife, Pernambuco, Brazil, 2002. A. C. Kakas and F. Toni. Computing argumentation in logic programming. Journal of Logic and Computation, 9(4):515–562, 1999. M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA, 1994. R. P. Loui. Defeat among arguments: a system of defeasible inference. Computational Intelligence, 2:100–106, 1987. J. P. Martins. Belief revision. In Shapiro, S. C., editor, Encyclopedia of Artificial Intelligence, pages 110–116. John Wiley & Sons, New York, second edition, 1992. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA, 2nd edition, 1988. J. Pearl and A. Paz. GRAPHOIDS: A graph-based logic for reasoning about relevance relations. Technical Report 850038 (R-53-L), Cognitive Systems Laboratory, University of California, 1985. J. L. Pollock. How to reason defeasibly. Artificial Intelligence, 57:1–42, 1992. H. Prakken. Logical Tools for Modelling Legal Argument. A Study of Defeasible Reasoning in Law. Kluwer Law and Philosophy Library, Dordrecht, 1997. H. Prakken and G. Vreeswijk. Logics for Defeasible Argumentation, volume 4 of Handbook of Philosophical Logic. Kluwer Academic Publishers, Dordrecht, 2 edition, 2002. S. C. Shapiro. Belief revision and truth maintenance systems: An overview and a proposal. Technical Report CSE 98-10, Dept of Computer Science and Engineering, State University of New York at Buffalo, 1998. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Adaptive Computation and Machine Learning Series. MIT Press, 2nd edition, January 2000. J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B (Methodological), 64(3):479–498, 2002. M. Studen´ . Conditional independence relations have no finite complete characterization. In Transy actions of the 11th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, volume B, pages 377–396, 1991. 339 B ROMBERG AND M ARGARITIS F. Toni and A. C. Kakas. In A. Nerode, editor, 3rd International Conference on Logic Programming and Non-monotonic Reasoning, volume 928 of Lecture Notes in Artificial Intelligence, pages 401–415. Springer Verlag, 1995. 340
3 0.28359503 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal
Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL
4 0.28260642 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
5 0.27722335 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options
6 0.27319235 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.27140173 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
8 0.26645267 38 jmlr-2009-Hash Kernels for Structured Data
9 0.26478693 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
10 0.26357713 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
11 0.26256484 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
12 0.26073393 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
14 0.25974312 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
15 0.2586478 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
16 0.2579318 48 jmlr-2009-Learning Nondeterministic Classifiers
17 0.25728416 29 jmlr-2009-Estimating Labels from Label Proportions
18 0.25669667 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
19 0.2565853 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
20 0.25638825 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications