jmlr jmlr2005 jmlr2005-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
Reference: text
sentIndex sentText sentNum sentScore
1 We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. [sent-12, score-0.162]
2 As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. [sent-14, score-0.248]
3 We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. [sent-17, score-0.722]
4 As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. [sent-19, score-0.276]
5 Given a Bayesian network N , we introduce Edim(N ) for denoting the smallest dimension d such that the decisions represented by N can be implemented as inner products in the d-dimensional Euclidean space. [sent-53, score-0.222]
6 The parameters can be arbitrary, where we speak of an unconstrained network, or they may be required to satisfy certain restrictions, in which case we have a network with a reduced parameter collection. [sent-56, score-0.232]
7 For both network types, we show that the “natural” inner product space, which can obtained from the probabilistic model by straightforward algebraic manipulations, has a dimension that is the smallest possible up to a factor of 2, and even up to an additive term of 1 in some cases. [sent-57, score-0.222]
8 The lower bounds in all these cases are obtained by analyzing the Vapnik-Chervonenkis (VC) dimension of the concept class associated with the Bayesian network. [sent-59, score-0.23]
9 Thus, the tight bounds on Edim(N ) reveal that the smallest possible 1384 I NNER P RODUCT S PACES FOR BAYESIAN N ETWORKS Euclidean dimension for a Bayesian network with an explicitly given parameter collection is closely tied to its sample complexity. [sent-62, score-0.287]
10 As a second topic, we investigate a class of probabilistic models known as logistic autoregressive Bayesian networks or sigmoid belief networks. [sent-63, score-0.469]
11 Finally, we get interested in the question whether it is possible to establish an exponential lower bound on Edim(N ) for the logistic autoregressive Bayesian network. [sent-68, score-0.45]
12 We succeed in giving a positive answer for an unnatural variant of this network that we introduce and call the modified logistic autoregressive Bayesian network. [sent-71, score-0.513]
13 The proof for this lower bound is based on the idea of embedding one concept class into another. [sent-74, score-0.189]
14 The exponential lower bound for the modified logistic autoregressive network is derived in Section 5. [sent-87, score-0.564]
15 2, we define Bayesian networks and the distributions and concept classes they induce. [sent-97, score-0.17]
16 The idea of a linear arrangement for a concept class is presented in Section 2. [sent-98, score-0.199]
17 , sm } ⊆ X is said to be shattered by C if for every binary vector b ∈ {−1, 1}m there exists some concept f ∈ C such that f (si ) = bi for i = 1, . [sent-106, score-0.265]
18 The 1385 NAKAMURA , S CHMITT, S CHMITT AND S IMON Vapnik-Chervonenkis (VC) dimension of C is given by VCdim(C ) = sup{m | there is some S ⊆ X shattered by C and |S| = m}. [sent-110, score-0.194]
19 We use the sign function for mapping a real-valued function g to a ±1-valued concept sign ◦ g. [sent-112, score-0.23]
20 Given a concept class C over domain X and a concept class C over domain X , we write C ≤ C if there exist mappings C f → f ∈C and X x→x ∈X satisfying f (x) = f (x ) for every f ∈ C and x ∈ X . [sent-113, score-0.231]
21 Obviously, if S ⊆ X is an melement set that is shattered by C then S = {s | s ∈ S} ⊆ X is an m-element set that is shattered by C . [sent-115, score-0.268]
22 a collection (pi,α )i∈V,α∈{0,1}mi of programmable parameters with values in the open interval ]0, 1[, where mi denotes the number of predecessors of node i, that is, mi = |{ j ∈ V | ( j, i) ∈ E}|, 3. [sent-120, score-0.315]
23 Example 1 (kth-order Markov chain) For k ≥ 0, let Nk denote the unconstrained Bayesian network with Pi = {i − 1, . [sent-137, score-0.197]
24 Definition 2 Let N be a Bayesian network with nodes 1, . [sent-151, score-0.161]
25 An unconstrained network that is highly connected may have a number of parameters that grows exponentially in the number of nodes. [sent-163, score-0.197]
26 We consider two types of constraints giving rise to the definitions of networks with a reduced parameter collection and logistic autoregressive networks. [sent-165, score-0.546]
27 Definition 3 A Bayesian network with a reduced parameter collection is a Bayesian network with the following constraints: For every i ∈ {1, . [sent-166, score-0.336]
28 These networks contain a decision tree Ti (or, alternatively, a decision graph Gi ) over the parent-variables of xi for every node i. [sent-184, score-0.158]
29 (2) NAKAMURA , S CHMITT, S CHMITT AND S IMON We finally introduce the so-called logistic autoregressive Bayesian networks, originally proposed by McCullagh and Nelder (1983), that have been shown to perform surprisingly well on certain problems (see also Neal, 1992, Saul et al. [sent-194, score-0.399]
30 Definition 4 The logistic autoregressive Bayesian network Nσ is the fully connected Bayesian network with constraints on the parameter collection given as ∀i = 1, . [sent-196, score-0.669]
31 Definition 5 Let N be a Bayesian network with nodes 1, . [sent-204, score-0.161]
32 Definition 6 A d-dimensional linear arrangement for a concept class C over domain X is given by collections (u f ) f ∈C and (vx )x∈X of vectors in Rd such that ∀ f ∈ C , x ∈ X : f (x) = sign(u f vx ). [sent-215, score-0.232]
33 If CN is the concept class induced by a Bayesian network N , we write Edim(N ) instead of Edim(CN ). [sent-218, score-0.214]
34 Let PARITYn be the concept class {ha | a ∈ {0, 1}n } of parity functions on the Boolean domain given by ha (x) = (−1)a x , that is, ha (x) is the parity of those xi where ai = 1. [sent-225, score-0.288]
35 We obtain bounds for unconstrained networks and for networks with a reduced parameter collection by providing concrete linear arrangements. [sent-230, score-0.345]
36 Theorem 9 Every unconstrained Bayesian network N satisfies Edim(N ) ≤ n [ i=1 n 2Pi ∪{i} ≤ 2 · ∑ 2mi . [sent-232, score-0.197]
37 i=1 Proof From the expansion of P in equation (1) and the corresponding expansion of Q (with parameters qi,α in the role of pi,α ), we obtain log P(x) Q(x) n = ∑ ∑ xi Mi,α (x) log i=1 α∈{0,1}mi 1 − pi,α pi,α + (1 − xi )Mi,α (x) log qi,α 1 − qi,α . [sent-233, score-0.211]
38 1389 NAKAMURA , S CHMITT, S CHMITT AND S IMON Theorem 11 Let N R denote the Bayesian network that has a reduced parameter collection (pi,c )1≤i≤n,1≤c≤di in the sense of Definition 3. [sent-248, score-0.191]
39 We make use of the obvious relationship log P(x) Q(x) n = di ∑∑ xi Ri,c (x) log i=1 c=1 1 − pi,c pi,c + (1 − xi )Ri,c (x) log qi,c 1 − qi,c . [sent-251, score-0.268]
40 The linear arrangements for unconstrained Bayesian networks or for Bayesian networks with a reduced parameter collection were easy to find. [sent-254, score-0.351]
41 The concept class arising from network N0 , which we consider first, is the well-known Na¨ve Bayes classifier. [sent-267, score-0.214]
42 , en , 1 the class CN0 of concepts induced by N0 , consisting of the functions of the form sign log P(x) Q(x) n = sign pi 1 − pi ∑ xi log qi + (1 − xi ) log 1 − qi , i=1 where pi , qi ∈]0, 1[, for i ∈ {1, . [sent-285, score-1.442]
43 A further type of Bayesian network for which we derive the exact dimension has some kind of bipartite graph underlying where one set of nodes serves as the set of parents for all nodes in the other set. [sent-305, score-0.268]
44 1391 NAKAMURA , S CHMITT, S CHMITT AND S IMON / Theorem 14 For k ≥ 0, let Nk denote the unconstrained network with Pi = 0 for i = 1, . [sent-306, score-0.197]
45 According to j j Theorem 12, for each dichotomy (M − , M + ) there exist parameter values pi , qi , where 1 ≤ i ≤ n − k, j j such that N0 with these parameter settings induces this dichotomy on M. [sent-334, score-0.513]
46 Since every dichotomy of S can be implemented in this way, S is shattered by Nk . [sent-347, score-0.238]
47 1 we shall establish lower bounds on Edim(N ) for unconstrained Bayesian networks and in Section 4. [sent-351, score-0.223]
48 These results are obtained by providing embeddings of concept classes, as introduced in Section 2. [sent-354, score-0.166]
49 , n, Si is a set that is shattered by Fi , then ∪n τi (Si ) is shattered by CN ,F . [sent-392, score-0.268]
50 i=1 i=1 The preceding definition and lemma are valid for unconstrained as well as constrained networks as they make use only of the graph underlying the network and do not refer to the values of the parameters. [sent-394, score-0.292]
51 1 L OWER B OUNDS FOR U NCONSTRAINED BAYESIAN N ETWORKS The next theorem is the main step in deriving for an arbitrary unconstrained network N a lower bound on Edim(N ). [sent-398, score-0.278]
52 Theorem 17 Let N be an unconstrained Bayesian network and let Fi ∗ denote the set of all ±1valued functions on domain {0, 1}mi . [sent-400, score-0.197]
53 To this end, we define the parameters for the distributions P and Q as i−1 n pi,α = 2−2 1/2 /2 if fi (α) = −1, if fi (α) = +1, and qi,α = 1/2 if fi (α) = −1, −2i−1 n /2 if f (α) = +1. [sent-407, score-0.207]
54 2 i An easy calculation now shows that log pi,α qi,α = fi (α)2i−1 n and log 1 − pi,α < 1. [sent-408, score-0.175]
55 Thus, LN , f (x) = sign(log(P(x)/Q(x))) would follow immediately from sign log P(x) Q(x) = sign log 1393 pi∗ ,α∗ qi∗ ,α∗ = fi∗ (α∗ ). [sent-415, score-0.236]
56 Expanding P and Q as given in (3), we obtain log P(x) Q(x) = log pi∗ ,α∗ i∗ −1 +∑ qi∗ ,α∗ i=1 +∑ i∈I ∑ ∑ xi Mi,α (x) log α∈{0,1}mi (1 − xi )Mi,α (x) log α∈{0,1}mi pi,α qi,α 1 − pi,α 1 − qi,α , where I = {1, . [sent-419, score-0.264]
57 Corollary 18 Every unconstrained Bayesian network N satisfies n n [ i=1 i=1 ∑ 2mi ≤ Edim(N ) ≤ n 2Pi ∪{i} ≤ 2 · ∑ 2mi . [sent-426, score-0.197]
58 Theorem 20 Let N R denote the Bayesian network that has a reduced parameter collection (pi,c )1≤i≤n,1≤c≤di in the sense of Definition 3. [sent-437, score-0.191]
59 , fn ) of the form fi (x) = gi (Ri (x)) for some function gi : {1, . [sent-449, score-0.172]
60 Theorem 17 can be viewed as a special case of Theorem 20 since every unconstrained network can be considered as a network with a reduced parameter collection where the functions Ri are 1-1. [sent-456, score-0.419]
61 Corollary 21 Let N R denote the Bayesian network that has a reduced parameter collection (pi,c )1≤i≤n,1≤c≤di in the sense of Definition 3. [sent-459, score-0.191]
62 3 L OWER B OUNDS FOR L OGISTIC AUTOREGRESSIVE N ETWORKS The following result is not obtained by embedding a concept class into a logistic autoregressive Bayesian network. [sent-463, score-0.537]
63 2 to derive a bound using the VC dimension by directly showing that these networks can shatter sets of the claimed size. [sent-468, score-0.186]
64 Theorem 22 Let Nσ denote the logistic autoregressive Bayesian network from Definition 4. [sent-469, score-0.513]
65 Proof We show that the following set S is shattered by the concept class CNσ . [sent-471, score-0.234]
66 1 − qi,α (9) The expansion of P and Q yields for every s(i,c) ∈ S, log P(s(i,c) ) Q(s(i,c) ) = log pi,αi,c i−1 +∑ qi,αi,c j=1 +∑ j∈I ∑ ∑ α∈{0,1} j−1 (i,c) α∈{0,1} j−1 (i,c) (1 − s j sj M j,α (s(i,c) ) log )M j,α (s(i,c) ) log p j,α q j,α 1 − p j,α 1 − q j,α , where I = {1, . [sent-501, score-0.243]
67 Lower Bounds via Embeddings of Parity Functions The lower bounds obtained in Section 4 rely on arguments based on the VC dimension of the respective concept class. [sent-510, score-0.23]
68 In particular, a quadratic lower bound for the logistic autoregressive network has been established. [sent-511, score-0.564]
69 Definition 23 The modified logistic autoregressive Bayesian network Nσ is the fully connected Bayesian network with nodes 0, 1, . [sent-514, score-0.674]
70 The crucial difference between Nσ and Nσ is the node n + 1 whose sigmoidal function receives the outputs of the other sigmoidal functions as input. [sent-522, score-0.291]
71 To obtain the bound, we provide an embedding of the concept class of parity functions. [sent-524, score-0.192]
72 The following theorem motivates this construction by showing that it is impossible to obtain an exponential lower bound for Edim(Nσ ) nor for Edim(Nσ ) using the VC dimension argument, as these networks have VC dimensions that are polynomial in n. [sent-525, score-0.211]
73 Theorem 24 The logistic autoregressive Bayesian network Nσ from Definition 4 and the modified logistic autoregressive Bayesian network Nσ from Definition 23 have a VC dimension that is bounded by O(n6 ). [sent-526, score-1.086]
74 The neural networks for the concepts in CNσ consist of sigmoidal units, product units, and units computing second-order polynomials. [sent-530, score-0.273]
75 Regarding the factors pxi (1 − pi,α )(1−xi ) , we observe that i,α pxi (1 − pi,α )(1−xi ) = pi,α xi + (1 − pi,α )(1 − xi ) i,α = 2pi,α xi − xi − pi,α + 1, where the first equation is valid because xi ∈ {0, 1}. [sent-537, score-0.212]
76 Similarly, the value of qxi (1 − qi,α )(1−xi ) can also be determined i,α using sigmoidal units and polynomial units of order 2. [sent-539, score-0.276]
77 This network has O(n2 ) parameters and O(n) computation nodes, each of which is a sigmoidal unit, a second-order unit, or a product unit. [sent-542, score-0.244]
78 Theorem 2 of Schmitt (2002) shows that every such network with W parameters and k computation nodes, which are sigmoidal and product units, has VC dimension O(W 2 k2 ). [sent-543, score-0.335]
79 Thus, we obtain the claimed bound O(n6 ) for the logistic autoregressive Bayesian network Nσ . [sent-545, score-0.569]
80 For the modified logistic autoregressive network we have only to take one additional sigmoidal unit into account. [sent-546, score-0.643]
81 Theorem 25 Let Nσ denote the modified logistic autoregressive Bayesian network with n + 2 nodes and assume that n is a multiple of 4. [sent-553, score-0.56]
82 For every a ∈ {0, 1}n/2 , there exists a pair (P, Q) of distributions from DNσ such that for every x ∈ {0, 1}n/2 , (−1)a x = sign log P(x ) . [sent-567, score-0.18]
83 The proof of the claim makes use of the following facts: Fact 1 For every a ∈ {0, 1}n/2 , function (−1)a x can be computed by a two-layer threshold circuit with n/2 threshold units at the first layer and one threshold unit as output node at the second layer. [sent-569, score-0.233]
84 Fact 2 Each two-layer threshold circuit C can be simulated by a two-layer sigmoidal circuit C with the same number of units and the following output convention: C(x) = 1 =⇒ C (x) ≥ 2/3 and C(x) = 0 =⇒ C (x) ≤ 1/3. [sent-570, score-0.399]
85 Fact 3 Network Nσ contains as a sub-network a two-layer sigmoidal circuit C with n/2 input nodes, n/2 sigmoidal units at the first layer, and one sigmoidal unit at the second layer. [sent-571, score-0.561]
86 , αn/2 ), where C denotes an arbitrary two-layer sigmoidal circuit as described in Fact 3. [sent-584, score-0.228]
87 Indeed, this is the output of a two-layer sigmoidal circuit C on the input (α1 , . [sent-590, score-0.228]
88 Let C be the sigmoidal circuit that computes (−1)a x for some fixed a ∈ {0, 1}n/2 according to Facts 1 and 2. [sent-595, score-0.228]
89 x = Combining Theorem 25 with Corollary 8, we obtain the exponential lower bound for the modified logistic autoregressive Bayesian network. [sent-615, score-0.45]
90 Corollary 26 Let Nσ denote the modified logistic autoregressive Bayesian network. [sent-616, score-0.399]
91 As the main results, we have established tight bounds on the Euclidean dimension of spaces in which two-label classifications of Bayesian networks with binary nodes can be implemented. [sent-629, score-0.248]
92 Bounds on the VC dimension of concept classes abound. [sent-631, score-0.16]
93 Surprisingly, our investigation of the dimensionality of embeddings lead to some exact values of the VC dimension for nontrivial Bayesian networks. [sent-633, score-0.153]
94 The VC dimension can be employed to obtain tight bounds on the complexity of model selection, that is, on the amount of information required for choosing a Bayesian network that performs well on unseen data. [sent-634, score-0.245]
95 In frameworks where this amount can be expressed in terms of the VC dimension, the tight bounds for the embeddings of Bayesian networks established here show that the sizes of the training samples required for learning can also be estimated using the Euclidean dimension. [sent-635, score-0.207]
96 Another consequence of this close relationship between VC dimension and Euclidean dimension is that these networks can be replaced by linear classifiers without a significant increase in the required sample sizes. [sent-636, score-0.19]
97 Whether these conclusions can be drawn also for the logistic autoregressive network is an open issue. [sent-637, score-0.513]
98 First, since we considered only networks with binary nodes, analogous questions regarding Bayesian networks with multiple-valued nodes or even continuous-valued nodes are certainly of interest. [sent-641, score-0.234]
99 Further, with regard to logistic autoregressive Bayesian networks, we were able to obtain an exponential lower bound only for a variant of them. [sent-643, score-0.45]
100 On the smallest possible dimension and the largest possiu ble margin of linear arrangements representing given concept classes. [sent-706, score-0.211]
wordName wordTfidf (topN-words)
[('edim', 0.476), ('autoregressive', 0.281), ('pi', 0.244), ('chmitt', 0.244), ('bayesian', 0.211), ('vcdim', 0.207), ('vc', 0.179), ('nakamura', 0.171), ('shattered', 0.134), ('sigmoidal', 0.13), ('qi', 0.123), ('imon', 0.122), ('nner', 0.122), ('paces', 0.122), ('roduct', 0.122), ('mi', 0.121), ('logistic', 0.118), ('network', 0.114), ('schmitt', 0.113), ('concept', 0.1), ('etworks', 0.099), ('arrangement', 0.099), ('circuit', 0.098), ('wi', 0.089), ('unconstrained', 0.083), ('hans', 0.082), ('ulrich', 0.082), ('forster', 0.082), ('ri', 0.077), ('units', 0.073), ('dichotomy', 0.073), ('cn', 0.073), ('nk', 0.071), ('networks', 0.07), ('fi', 0.069), ('embeddings', 0.066), ('sign', 0.065), ('bochum', 0.061), ('niels', 0.061), ('dimension', 0.06), ('di', 0.057), ('boolean', 0.057), ('parity', 0.054), ('dn', 0.054), ('log', 0.053), ('euclidean', 0.052), ('arrangements', 0.051), ('parityn', 0.049), ('inner', 0.048), ('nodes', 0.047), ('bounds', 0.045), ('tsuda', 0.045), ('berlin', 0.045), ('collection', 0.042), ('pxi', 0.041), ('rgen', 0.041), ('simon', 0.038), ('embedding', 0.038), ('gi', 0.037), ('atsuyoshi', 0.037), ('kiltz', 0.037), ('ruhr', 0.037), ('lecture', 0.036), ('reduced', 0.035), ('notes', 0.034), ('vx', 0.033), ('pn', 0.032), ('motoaki', 0.031), ('mccullagh', 0.031), ('koji', 0.031), ('lmi', 0.031), ('ower', 0.031), ('node', 0.031), ('every', 0.031), ('claimed', 0.03), ('theorem', 0.03), ('fn', 0.029), ('bin', 0.028), ('major', 0.028), ('ha', 0.027), ('nontrivial', 0.027), ('tommi', 0.027), ('corollary', 0.026), ('bound', 0.026), ('markov', 0.026), ('xi', 0.026), ('tight', 0.026), ('ji', 0.025), ('parent', 0.025), ('lemma', 0.025), ('lower', 0.025), ('alt', 0.025), ('kawanabe', 0.025), ('arriaga', 0.024), ('balcan', 0.024), ('eike', 0.024), ('frankl', 0.024), ('hajnal', 0.024), ('meir', 0.024), ('oliver', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
2 0.082204141 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
3 0.06744466 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.059535 54 jmlr-2005-Managing Diversity in Regression Ensembles
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
6 0.059050858 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
7 0.058311529 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
8 0.055700257 71 jmlr-2005-Variational Message Passing
9 0.053234149 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
10 0.046419229 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
12 0.043134101 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
13 0.040530927 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
14 0.040139757 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
15 0.039671559 32 jmlr-2005-Expectation Consistent Approximate Inference
16 0.039648399 44 jmlr-2005-Learning Module Networks
17 0.039376352 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
18 0.036007691 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
19 0.035230264 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
20 0.034410641 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
topicId topicWeight
[(0, 0.225), (1, -0.094), (2, -0.006), (3, 0.026), (4, -0.014), (5, -0.015), (6, -0.093), (7, -0.031), (8, 0.043), (9, -0.094), (10, 0.126), (11, 0.039), (12, 0.311), (13, 0.015), (14, 0.161), (15, 0.04), (16, -0.024), (17, -0.022), (18, -0.182), (19, -0.186), (20, 0.024), (21, 0.138), (22, 0.01), (23, -0.116), (24, 0.038), (25, -0.153), (26, 0.051), (27, -0.164), (28, 0.02), (29, -0.083), (30, 0.138), (31, -0.209), (32, 0.004), (33, -0.171), (34, -0.275), (35, 0.021), (36, 0.038), (37, -0.017), (38, -0.011), (39, 0.226), (40, 0.058), (41, -0.224), (42, 0.053), (43, 0.035), (44, -0.227), (45, 0.107), (46, 0.082), (47, -0.133), (48, 0.338), (49, -0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.96358263 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
2 0.27391517 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
3 0.2334722 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
4 0.2319964 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
5 0.21619505 54 jmlr-2005-Managing Diversity in Regression Ensembles
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
6 0.20285755 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
7 0.17956184 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
8 0.17536424 44 jmlr-2005-Learning Module Networks
9 0.17274651 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
10 0.15888692 36 jmlr-2005-Gaussian Processes for Ordinal Regression
11 0.15074684 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
12 0.14886777 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
13 0.14633046 3 jmlr-2005-A Classification Framework for Anomaly Detection
14 0.13735257 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
15 0.13200399 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
16 0.13014793 41 jmlr-2005-Kernel Methods for Measuring Independence
18 0.12919945 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
19 0.12377986 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
20 0.12196761 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
topicId topicWeight
[(13, 0.018), (17, 0.024), (19, 0.043), (36, 0.503), (37, 0.021), (42, 0.017), (43, 0.026), (52, 0.067), (59, 0.014), (70, 0.028), (82, 0.01), (88, 0.102), (90, 0.019), (94, 0.016)]
simIndex simValue paperId paperTitle
1 0.93776888 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
Author: Neil Lawrence
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation
same-paper 2 0.8906197 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
3 0.47524133 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
4 0.45317683 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.44953281 44 jmlr-2005-Learning Module Networks
Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman
Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN
6 0.44465062 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
7 0.43946952 64 jmlr-2005-Semigroup Kernels on Measures
8 0.43443555 54 jmlr-2005-Managing Diversity in Regression Ensembles
9 0.42836598 48 jmlr-2005-Learning the Kernel Function via Regularization
10 0.42641526 36 jmlr-2005-Gaussian Processes for Ordinal Regression
11 0.41791233 39 jmlr-2005-Information Bottleneck for Gaussian Variables
12 0.41354075 32 jmlr-2005-Expectation Consistent Approximate Inference
13 0.40912366 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
14 0.39744511 25 jmlr-2005-Denoising Source Separation
15 0.39741793 41 jmlr-2005-Kernel Methods for Measuring Independence
16 0.39686254 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
17 0.39040253 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.38244584 3 jmlr-2005-A Classification Framework for Anomaly Detection
19 0.38184252 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
20 0.37992439 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints