jmlr jmlr2010 jmlr2010-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
Reference: text
sentIndex sentText sentNum sentScore
1 L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. [sent-7, score-0.422]
2 Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization 1. [sent-12, score-0.884]
3 s s After centering and whitening x := C−1/2 (s − E[s ]) a Gaussian distribution is spherically symmetric 2 2 x 2 and the squared L2 -norm ||x ||2 = x1 + · · · + xn of the samples follow a χ2 -distribution (that is, the radial distribution is a χ-distribution). [sent-31, score-0.709]
4 Elliptically contoured distributions other than the Gaussian are obtained by using a radial distribution different from the χ-distribution (Kelker, 1970; Fang et al. [sent-32, score-0.424]
5 Here, we present a generalization of the class of L p -spherically symmetric distributions within the class of ν-spherical distributions that makes weaker assumptions about the symmetries in the data but still is analytically tractable. [sent-42, score-0.422]
6 Due to this nested structure we call this new class of distributions L p -nested symmetric distributions. [sent-47, score-0.397]
7 While L p -nested symmetric distributions are still invariant under reflections of the coordinate axes, permutation symmetry only holds within the subspaces of the L p -norms at the bottom of 3410 L p -N ESTED S YMMETRIC D ISTRIBUTIONS the cascade. [sent-49, score-0.445]
8 With this Jacobian at hand it is possible to compute the univariate radial distribution for an arbitrary L p -nested symmetric density and to define the uniform x x distribution on the L p -nested unit sphere Lν = {x ∈ Rn |ν(x ) = 1}. [sent-56, score-0.805]
9 Furthermore, we compute the surface area of the L p -nested unit sphere and, therefore, the general normalization constant for L p -nested symmetric distributions. [sent-57, score-0.468]
10 By deriving these general relations for the class of L p -nested symmetric distributions we have determined a new class of tractable ν-spherical distributions which is so far the only one containing the Gaussian, elliptically contoured, and L p -spherical distributions as special cases. [sent-58, score-0.546]
11 Osiewalski and Steel (1993) showed that the posterior on the location of a L p spherically symmetric distributions together with an improper Jeffrey’s prior on the scale does not depend on the particular type of L p -spherically symmetric distribution used. [sent-61, score-0.75]
12 Since ρ(x ) ∝ exp(−||x || p ) is an L p -spherically symmetric model, regularizers which can be written in terms of a norm have a tight link to L p -spherically symmetric distributions. [sent-66, score-0.575]
13 In an analogous way, L p -nested symmetric distributions exhibit a tight link to mixed-norm regularizers which have recently gained increasing interest in the machine learning community (see, e. [sent-67, score-0.387]
14 Finally, the only factorial L p -spherically symmetric distribution (Sinz et al. [sent-75, score-0.428]
15 Interestingly, L p spherically symmetric distributions other than the p-generalized Normal give rise to a non-linear ICA algorithm called radial Gaussianization for p = 2 (Lyu and Simoncelli, 2009) or radial factorization for arbitrary p (Sinz and Bethge, 2009). [sent-79, score-0.982]
16 As discussed below, L p -nested symmetric distributions are a natural extension of the linear L p -spherically symmetric ICA algorithm to ISA, and give rise to a more general non-linear ICA algorithm in the spirit of radial factorization. [sent-80, score-0.878]
17 Using this expression, we define the uniform distribution on the L p -nested unit sphere and the class of L p -nested symmetric distributions for an arbitrary L p -nested function in Section 3. [sent-82, score-0.537]
18 In Section 4 we derive an analytical form of L p -nested symmetric distributions when marginalizing out lower levels of the L p -nested cascade and demonstrate that marginals of L p -nested symmetric distributions are not necessarily L p -nested symmetric. [sent-83, score-0.783]
19 Additionally, we demonstrate that the only factorial L p -nested symmetric distribution is necessarily L p -spherically symmetric and discuss the implications of this result for mixed norm regularizers. [sent-84, score-0.731]
20 Finally, we derive a non-linear ICA algorithm for linearly mixed non-factorial L p -nested symmetric sources in Section 9 which we call nested radial factorization (NRF). [sent-89, score-0.657]
21 In general, the n leaves of the tree correspond to the n coefficients of the vector x ∈ Rn and each inner node computes the L p -norm of its children using its specific p. [sent-95, score-0.509]
22 We call the class of functions which is generated in this way L p -nested and the corresponding distributions, which are symmetric or invariant with respect to it, L p -nested symmetric distributions. [sent-96, score-0.532]
23 Figure (b) shows the same tree in multi-index notation where the multi-index of a node describes the path from the root node to that node in the tree. [sent-123, score-0.646]
24 , im Multi-index denoting a node in the tree: The single indices describe the path from the root node to the respective node I. [sent-128, score-0.559]
25 In this manner, an index is added for denoting the children of a particular node in the tree and each multi-index denotes the path to the respective node in the tree. [sent-134, score-0.502]
26 , n − 1, xn = r∆n un where ∆n = sgn xn and un = |xn | x f (x ) . [sent-201, score-0.51]
27 x Note that un is not part of the coordinate representation since normalization with 1/ f (x ) decreases the degrees of freedom u by one, that is, un can always be computed from all other ui by u x x solving f (u ) = f (x / f (x )) = 1 for un . [sent-202, score-0.879]
28 Apart from that, it can be used to compute the radial distribution for a given L p -nested symmetric distribution. [sent-209, score-0.574]
29 The function gI computes the value of the node I from all other leaves that are not part of the subtree under I by fixing the value of the root node to one. [sent-230, score-0.621]
30 (1995) every ν-spherical and hence any L p -nested symmetric density has the form x ρ(x ) = x φ( f (x )) , n−1 S (1) x f (x ) f (4) where S f is the surface area of L f and φ is a density on R+ . [sent-267, score-0.399]
31 Inserting the surface area in Equation 4, we obtain the general form of an L p -nested symmetric distribution for any given radial density φ. [sent-272, score-0.669]
32 (1995) imply that for any ν-spherically symmetric distribution, the radial part is independent of the directional part, that is, r is independent of u . [sent-276, score-0.534]
33 From Equation (9), we can see that its density function must be the inverse of the surface area of L f times the radial density when x transforming (4) into the coordinates of Definition 1 and separating r and u (the factor f (x )n−1 = r cancels due to the determinant of the Jacobian). [sent-280, score-0.668]
34 i=1 Example 5 As a second illustrative example, we consider the uniform density on the L p -nested x x unit ball, that is, the set {x ∈ Rn | f (x ) ≤ 1}, and derive its radial distribution φ. [sent-302, score-0.425]
35 = ∏ pℓI −1 ∏ B i=1I , pI I n−1 2 p V f (1) I∈I k=1 After separating out the uniform distribution on the L p -nested unit sphere, we obtain the radial distribution φ(r) = nrn−1 for 0 < r ≤ 1 which is a β-distribution with parameters n and 1. [sent-305, score-0.427]
36 Since the radial and the uniform component on the L p nested unit sphere are statistically independent, we can get a sample from the uniform distribution on the L p -nested unit sphere by simply normalizing the sample from the simple distribution. [sent-308, score-0.667]
37 Afterwards we can multiply it with a radius drawn from the radial distribution of the L p -nested symmetric distribution that we actually want to sample from. [sent-309, score-0.614]
38 Marginals In this section we discuss two types of marginals: First, we demonstrate that, in contrast to L p spherically symmetric distributions, marginals of L p -nested symmetric distributions are not necessarily L p -nested symmetric again. [sent-314, score-1.011]
39 Gupta and Song (1997) show that marginals of L p -spherically symmetric distributions are again L p -spherically symmetric. [sent-317, score-0.41]
40 Since any L p -nested symmetric distribution in two dimensions must be L p -spherically symmetric, it cannot be L p -nested symmetric as well. [sent-324, score-0.572]
41 This is not surprising given that the general form of marginals for L p -spherically symmetric distributions involves an integral that cannot be solved analytically in general and is therefore not very useful in practice (Gupta and Song, 1997). [sent-327, score-0.464]
42 For that reason we cannot expect marginals of L p -nested symmetric distributions to have a simple form. [sent-328, score-0.41]
43 Since any two-dimensional L p nested symmetric distribution must be L p -spherically symmetric, the marginals should be identical. [sent-332, score-0.425]
44 From the form of a general L p -nested function and the corresponding symmetric distribution, one might think that the layer marginals are L p -nested symmetric again. [sent-339, score-0.633]
45 Suppose we integrate out complete subtrees from the tree associated with f , that is, we transform subtrees into radial times uniform variables and integrate out the latter. [sent-342, score-0.391]
46 Let J be the set of multi-indices of those nodes that have become new leaves, that is, whose subtrees have been removed, and let nJ be the number of leaves (in the original tree) in the subtree under the node J. [sent-343, score-0.425]
47 We can use the Corollary to prove an interesting fact about L p -nested symmetric distributions: The only factorial L p -nested symmetric distribution must be L p -spherically symmetric. [sent-380, score-0.694]
48 We provide a toolbox online for fitting L p -spherically symmetric and L p -nested symmetric distributions to data. [sent-426, score-0.61]
49 The simplest case is to fit the parameters of the radial distribution when the tree structure, the values of the pI , and W are fixed. [sent-438, score-0.395]
50 Due to the special form of L p -nested symmetric distributions (4), it then suffices to carry out maximum likelihood estimation on the radial component only, which renders maximum likelihood estimation efficient and robust. [sent-439, score-0.612]
51 For example, if two leaves are separated by the root node in the original full binary tree, there is no way to prune out inner nodes such that the path between those two nodes will not contain the root node anymore. [sent-469, score-0.745]
52 The computational complexity for the estimation of all other parameters despite the tree structure is difficult to assess in general because they depend, for example, on the particular radial distribution used. [sent-470, score-0.395]
53 Since every inner node in an L p -nested tree must have at least two children, the worst case would be a full binary tree which has 2n − 1 nodes and leaves. [sent-477, score-0.423]
54 A Gaussian Scale Mixture (GSM) model does not need to estimate ano other orthogonal rotation Q because it belongs to the class of spherically symmetric distributions and is, therefore, invariant under transformations from SO(n) (Wainwright and Simoncelli, 2000). [sent-485, score-0.413]
55 Sampling from L p -Nested Symmetric Distributions In this section, we derive a sampling scheme for arbitrary L p -nested symmetric distributions which can for example be used for solving integrals when using L p -nested symmetric distributions for Bayesian learning. [sent-488, score-0.716]
56 By multiplying those uniform samples with new samples from another radial distribution, one obtains samples from another L p -nested symmetric distribution. [sent-490, score-0.57]
57 Due to Proposition 9, no such factorial L p -nested symmetric distribution exists. [sent-494, score-0.428]
58 Instead we choose to sample from the uniform distribution inside the L p -nested unit ball for which we already computed the radial distribution in Example x 5. [sent-496, score-0.456]
59 The key is now to not transform the whole integral into radial and uniform coordinates at once, but successively upwards in the tree. [sent-502, score-0.455]
60 S INZ AND B ETHGE To solve the integral x x {x : f (x )≤1 & x ∈Rn } + x dx , we first transform x2 and x3 into radial and uniform coordinates only. [sent-505, score-0.452]
61 As shown in Example 5 the radial distribution of the uniform distribution on the unit ball is β [n, 1], and as just indicated by the example above, the intermediate results can be seen as transformed variables from a Dirichlet distribution. [sent-515, score-0.456]
62 Since the sampling procedure is like expanding the tree node by node starting with the root, the number of inner nodes and leaves is the total number of samples that have to be drawn from Dirichlet distributions. [sent-524, score-0.643]
63 3428 L p -N ESTED S YMMETRIC D ISTRIBUTIONS Algorithm 1 Exact sampling algorithm for L p -nested symmetric distributions Input: The radial distribution φ of an L p -nested symmetric distribution ρ for the L p -nested function f. [sent-528, score-0.986]
64 Sample a new radius v0 from the radial distribution of the target radial distribution φ and ˜/ ˜ ˜/ obtain the sample via x = v0 · u . [sent-546, score-0.616]
65 x Proof Given any L p -nested symmetric distribution ρ( f (x )), the transformation into the polar-like coordinates yields the following relation 1= x x ρ( f (x ))dx = u u u u ∏ GL (u L )rn−1 ρ(r)drdu = ∏ GL (u L )du · rn−1 ρ(r)dr. [sent-560, score-0.401]
66 L p -nested symmetric distributions are a superclass of L p spherically symmetric distributions. [sent-580, score-0.679]
67 The only class of distributions in the intersection between spherically symmetric distributions and ICA models is the Gaussian. [sent-585, score-0.491]
68 shown that the p-generalized Normal is the only factorial model in the class of L p -spherically symmetric models (Sinz et al. [sent-586, score-0.388]
69 , 2009a), and, by Proposition 9, also the only factorial L p -nested symmetric distribution. [sent-587, score-0.388]
70 Since any choice of radial distribution φ determines a particular L p -spherically symmetric distribution, the idea is to explore the space between factorial and non-factorial models by using a very flexible density φ on the radius. [sent-616, score-0.734]
71 Since L p -nested symmetric distributions do not provide such a factorial prior, Expectation Propagation is not directly applicable. [sent-640, score-0.466]
72 The idea is based on the observation that the choice of the radial distribution φ already determines the type of L p -nested symmetric distribution. [sent-655, score-0.574]
73 This also means that by changing the radial distribution by remapping the data, the distribution could possibly be turned in a factorial one. [sent-656, score-0.47]
74 3434 L p -N ESTED S YMMETRIC D ISTRIBUTIONS Exactly the same method cannot work for L p -nested symmetric distributions since Proposition 9 states that there is no factorial distribution into which we could map the data by merely changing the radial distribution. [sent-659, score-0.774]
75 Instead we have to remap the data in an iterative fashion beginning with changing the radial distribution at the root node into the radial distribution of the L p -nested ISA shown in Equation (11). [sent-660, score-0.859]
76 Figure 6: L p -nested non-linear ICA for the tree of Example 6: For an arbitrary L p -nested symmetric distribution, using Equation (12), the radial distribution can be remapped such that the children of the root node become independent. [sent-664, score-1.003]
77 The remaining subtrees are again L p -nested symmetric and have a particular radial distribution that can be remapped into the same one that makes their root nodes’ children independent. [sent-667, score-0.758]
78 Example 6 Consider the function y f (y ) = / / |y1 | p0 + |y2 | p0,2 + (|y3 | p2,2 + |y4 | p2,2 ) p0,2 / p2,2 p0 / p0,2 / 1 p0 / for y = W x where W has been estimated by fitting an L p -nested symmetric distribution with a flexible radial distribution to W x as described in Section 5. [sent-670, score-0.614]
79 Interestingly, the radial distributions of the root node’s children are also γ p except that the shape pan/ rameter is p0,i . [sent-705, score-0.53]
80 Since the radial remappings do not change the likelihood, the likelihood of the non-linearly separated data is the 3437 S INZ AND B ETHGE same as the likelihood of the data under L p -nested symmetric distribution that was fitted to it in the first place. [sent-717, score-0.574]
81 Let gI (r) = (Fφ⊥⊥ ◦ Fφs )(r) denote the radial transform at node I in Algorithm 2. [sent-723, score-0.426]
82 Using these results, we introduced the uniform distribution on the L p -nested unit sphere and the general form of an L p -nested symmetric distribution for arbitrary L p -nested functions and radial distributions. [sent-729, score-0.767]
83 We also derived an expression for the joint distribution of inner nodes of an L p -nested tree and derived a sampling scheme for an arbitrary L p -nested symmetric distribution. [sent-730, score-0.512]
84 L p -nested symmetric distributions naturally provide the class of probability distributions corresponding to mixed norm priors, allowing full Bayesian inference in the corresponding probabilistic models. [sent-731, score-0.459]
85 We showed that a robustness result for Bayesian inference of the location parameter known for L p -spherically symmetric distributions carries over to the L p -nested symmetric class. [sent-732, score-0.641]
86 Finally, we showed how L p -nested symmetric distributions can be used to construct a non-linear ICA algorithm called nested radial factorization (NRF). [sent-734, score-0.698]
87 un−1 un Now we can use the Laplace’s expansion of the determinant with respect to the last column. [sent-797, score-0.443]
88 This yields ∂ui | det J | = rn−1 n ∑ (−1)n+k uk det Jk k=1 = rn−1 ∂xn n−1 ∑ (−1)n+k uk det Jk + (−1)2n ∂r k=1 = rn−1 ∂un n−1 ∑ (−1)n+k uk (−1)n−1−k ∂uk + un k=1 n−1 = rn−1 − ∑ uk k=1 ∂un + un . [sent-812, score-1.219]
89 As in the example, the chain rule starts at the leaf un and ascends in the tree until it reaches the lowest node whose subtree contains both, un and uq . [sent-816, score-1.178]
90 Proof Both of the first equations are obvious, since only those nodes have a non-zero derivative for which the subtree actually depends on uq . [sent-832, score-0.405]
91 ,it−1 −1 with ∆q = sgn uq and |uq | p = (∆q uq ) p . [sent-887, score-0.518]
92 As shown in Lemma (2) rn−1 | det J | has the form 1 r ∂un · uk + un . [sent-929, score-0.454]
93 Remember, that by Lemma (13) the terms ∂ uq ∂uq un for m ≤ q < n have the form uq ∂ u u u un = − GL,ℓd (u L,ℓ ) · FL,i1 (u L,i1 ) · . [sent-938, score-1.028]
94 k=1 u Note that all terms GL,ℓd (u L,ℓ )−1 ∂un · uk for m ≤ k < n now have the form ∂uk d u GL,ℓd (u L,ℓ )−1 uk d ∂ u u un = −FL,i1 (u L,i1 ) · . [sent-949, score-0.479]
95 ,it−1 as in the derivatives ∂ u GL,ℓd (u L,ℓ )−1 uq ∂uq un . [sent-972, score-0.514]
96 For computing the general induction step suppose I is an inner node whose children are leaves or contracted leaves. [sent-1001, score-0.453]
97 After in˜ tegrating out u , assuming that the xi are statistically independent, we obtain the density of vi which is equal to (16) if and only if pi = p0 . [sent-1041, score-0.403]
98 However, if p0 and pi are equal, the hierarchy of the L p / / nested function shrinks by one layer since pi and p0 cancel themselves. [sent-1042, score-0.646]
99 Since the only factorial L p -spherically symmetric distribution is the p-generalized Normal (Sinz et al. [sent-1044, score-0.428]
100 Therefore, the determinant of the Jacobian for each single step is the determinant for the radial transformation on the respective dimensions. [sent-1050, score-0.688]
wordName wordTfidf (topN-words)
[('pi', 0.279), ('radial', 0.268), ('symmetric', 0.266), ('uq', 0.259), ('un', 0.255), ('isa', 0.245), ('determinant', 0.188), ('gl', 0.163), ('ested', 0.161), ('ethge', 0.161), ('inz', 0.161), ('node', 0.158), ('sinz', 0.144), ('ymmetric', 0.137), ('gi', 0.133), ('ica', 0.128), ('istributions', 0.124), ('factorial', 0.122), ('leaves', 0.121), ('jacobian', 0.118), ('fi', 0.113), ('uk', 0.112), ('nrf', 0.1), ('children', 0.099), ('subtree', 0.099), ('tree', 0.087), ('det', 0.087), ('gupta', 0.086), ('vi', 0.086), ('root', 0.085), ('distributions', 0.078), ('song', 0.074), ('sphere', 0.074), ('hyv', 0.072), ('rinen', 0.072), ('pj', 0.071), ('spherically', 0.069), ('marginals', 0.066), ('simoncelli', 0.065), ('bethge', 0.061), ('proposition', 0.06), ('subspaces', 0.059), ('ni', 0.057), ('surface', 0.057), ('rn', 0.055), ('integral', 0.054), ('nested', 0.053), ('lewicki', 0.052), ('coordinates', 0.051), ('dirichlet', 0.048), ('nodes', 0.047), ('elliptically', 0.046), ('lyu', 0.046), ('osiewalski', 0.046), ('upwards', 0.046), ('dvi', 0.046), ('transformation', 0.044), ('inner', 0.044), ('ui', 0.044), ('vj', 0.043), ('dx', 0.043), ('regularizers', 0.043), ('unit', 0.043), ('coordinate', 0.042), ('distribution', 0.04), ('ir', 0.039), ('contoured', 0.038), ('uip', 0.038), ('density', 0.038), ('mixed', 0.037), ('leaf', 0.037), ('uniform', 0.036), ('olshausen', 0.036), ('ster', 0.035), ('layer', 0.035), ('factorization', 0.033), ('vni', 0.033), ('detw', 0.033), ('dv', 0.032), ('location', 0.031), ('child', 0.031), ('contracted', 0.031), ('fernandez', 0.031), ('pjk', 0.031), ('uipi', 0.031), ('cascade', 0.029), ('ball', 0.029), ('ik', 0.029), ('pl', 0.029), ('sampling', 0.028), ('normalization', 0.028), ('chain', 0.028), ('transforming', 0.028), ('dr', 0.027), ('steel', 0.026), ('quadrant', 0.026), ('whitening', 0.026), ('vn', 0.026), ('normal', 0.025), ('exp', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 69 jmlr-2010-Lp-Nested Symmetric Distributions
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
2 0.10872168 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
Author: Yael Ben-Haim, Elad Tom-Tov
Abstract: We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classifiers, distributed computing, streaming data, scalability
3 0.10787717 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
4 0.073323734 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
Author: Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer
Abstract: Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR’s. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose computationally efficient methods for estimating the model, as well as methods to assess the significance of the causal influences. The model is successfully applied on financial and brain imaging data. Keywords: structural vector autoregression, structural equation models, independent component analysis, non-Gaussianity, causality
5 0.064598463 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
6 0.058798324 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
7 0.056774892 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
8 0.056160741 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
9 0.055771548 22 jmlr-2010-Classification Using Geometric Level Sets
10 0.054892443 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
11 0.050270882 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
12 0.047298837 82 jmlr-2010-On Learning with Integral Operators
13 0.047053713 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
14 0.045920983 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.044435773 44 jmlr-2010-Graph Kernels
16 0.042860217 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
17 0.041796647 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
18 0.041309882 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
19 0.040868759 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
20 0.040580966 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
topicId topicWeight
[(0, -0.208), (1, -0.002), (2, -0.031), (3, 0.015), (4, -0.064), (5, -0.006), (6, -0.044), (7, -0.019), (8, 0.012), (9, 0.051), (10, -0.007), (11, 0.073), (12, 0.103), (13, -0.119), (14, -0.105), (15, 0.155), (16, -0.103), (17, 0.054), (18, -0.182), (19, -0.147), (20, 0.262), (21, -0.114), (22, 0.044), (23, -0.12), (24, 0.2), (25, 0.006), (26, 0.099), (27, 0.133), (28, -0.231), (29, -0.069), (30, 0.008), (31, -0.041), (32, 0.063), (33, -0.157), (34, 0.039), (35, -0.124), (36, -0.035), (37, 0.099), (38, -0.083), (39, -0.061), (40, 0.012), (41, 0.005), (42, -0.046), (43, 0.057), (44, 0.012), (45, 0.02), (46, 0.044), (47, 0.043), (48, 0.123), (49, -0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.97342891 69 jmlr-2010-Lp-Nested Symmetric Distributions
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
2 0.65935153 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
Author: Yael Ben-Haim, Elad Tom-Tov
Abstract: We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classifiers, distributed computing, streaming data, scalability
3 0.40534541 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
4 0.38193497 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
Author: Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio
Abstract: Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a data set with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, data sets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale data set, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before. Keywords: large scale, metric learning, image similarity, online learning ∗. Varun Sharma and Uri Shalit contributed equally to this work. †. Also at ICNC, The Hebrew University of Jerusalem, 91904, Israel. c 2010 Gal Chechik, Varun Sharma, Uri Shalit
5 0.34424734 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
Author: Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer
Abstract: Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR’s. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose computationally efficient methods for estimating the model, as well as methods to assess the significance of the causal influences. The model is successfully applied on financial and brain imaging data. Keywords: structural vector autoregression, structural equation models, independent component analysis, non-Gaussianity, causality
6 0.30226359 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
7 0.29843837 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
8 0.29787272 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
9 0.29184872 15 jmlr-2010-Approximate Tree Kernels
10 0.2388604 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
11 0.23675379 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
12 0.23414259 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
13 0.22917098 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
14 0.22555195 82 jmlr-2010-On Learning with Integral Operators
15 0.21768391 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
16 0.21509784 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation
17 0.21322261 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
18 0.2092613 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
19 0.2092533 22 jmlr-2010-Classification Using Geometric Level Sets
20 0.20436907 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
topicId topicWeight
[(3, 0.035), (4, 0.059), (8, 0.013), (21, 0.021), (32, 0.056), (33, 0.041), (36, 0.067), (37, 0.049), (75, 0.151), (81, 0.023), (85, 0.044), (98, 0.33)]
simIndex simValue paperId paperTitle
same-paper 1 0.72682989 69 jmlr-2010-Lp-Nested Symmetric Distributions
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
2 0.48501113 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
3 0.48091155 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
Author: Qiang Wu, Justin Guinney, Mauro Maggioni, Sayan Mukherjee
Abstract: The problems of dimension reduction and inference of statistical dependence are addressed by the modeling framework of learning gradients. The models we propose hold for Euclidean spaces as well as the manifold setting. The central quantity in this approach is an estimate of the gradient of the regression or classification function. Two quadratic forms are constructed from gradient estimates: the gradient outer product and gradient based diffusion maps. The first quantity can be used for supervised dimension reduction on manifolds as well as inference of a graphical model encoding dependencies that are predictive of a response variable. The second quantity can be used for nonlinear projections that incorporate both the geometric structure of the manifold as well as variation of the response variable on the manifold. We relate the gradient outer product to standard statistical quantities such as covariances and provide a simple and precise comparison of a variety of supervised dimensionality reduction methods. We provide rates of convergence for both inference of informative directions as well as inference of a graphical model of variable dependencies. Keywords: gradient estimates, manifold learning, graphical models, inverse regression, dimension reduction, gradient diffusion maps
4 0.47673291 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
5 0.47115633 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
6 0.46982077 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
8 0.46871841 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
9 0.46822062 63 jmlr-2010-Learning Instance-Specific Predictive Models
10 0.46779382 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
11 0.4668228 56 jmlr-2010-Introduction to Causal Inference
12 0.46660233 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
13 0.46486127 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
14 0.4630053 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
15 0.46211377 15 jmlr-2010-Approximate Tree Kernels
16 0.46186176 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
17 0.46150678 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
18 0.46115893 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
19 0.46068257 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
20 0.46008387 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm