nips nips2011 nips2011-152 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. [sent-13, score-0.106]
2 They are broadly established as an easy way to construct nonlinear algorithms from linear ones, by embedding data points into reproducing kernel Hilbert spaces (RKHSs) [1, 14, 15]. [sent-17, score-0.406]
3 Over the last few years, generalization of these techniques to Banach spaces has gained interest. [sent-18, score-0.106]
4 This is because any two Hilbert spaces over a common scalar field with the same dimension are isometrically isomorphic while Banach spaces provide more variety in geometric structures and norms that are potentially useful for learning and approximation. [sent-19, score-0.212]
5 To sample the literature, classification in Banach spaces, more generally in metric spaces were studied in [3, 22, 11, 5]. [sent-20, score-0.128]
6 Minimizing a loss function subject to a regularization condition on a norm in a Banach space was studied by [3, 13, 24, 21] and online learning in Banach spaces was considered in [17]. [sent-21, score-0.148]
7 While all these works have focused on theoretical generalizations of Hilbert space methods to Banach spaces, the practical viability and inherent computational issues associated with the Banach space methods has so far not been highlighted. [sent-22, score-0.084]
8 The goal of this paper is to study the advantages/disadvantages of learning in Banach spaces in comparison to Hilbert space methods, in particular, from the point of view of embedding probability measures into these spaces. [sent-23, score-0.297]
9 The concept of embedding probability measures into RKHS [4, 6, 9, 16] provides a powerful and straightforward method to deal with high-order statistics of random variables. [sent-24, score-0.149]
10 An immediate application of this notion is to problems of comparing distributions based on finite samples: examples include tests of homogeneity [9], independence [10], and conditional independence [7]. [sent-25, score-0.072]
11 Formally, suppose we are given the set P(X ) of all Borel probability measures defined on the topological space X , and the RKHS (H, k) of functions on X with k as its reproducing kernel (r. [sent-26, score-0.326]
12 X 1 (1) Given the embedding in (1), the RKHS distance between the embeddings of P and Q defines a pseudo-metric between P and Q as γk (P, Q) := X k(·, x) dP(x) − k(·, x) dQ(x) X . [sent-30, score-0.236]
13 (2) H It is clear that when the embedding in (1) is injective, then P and Q can be distinguished based on their embeddings X k(·, x) dP(x) and X k(·, x) dQ(x). [sent-31, score-0.253]
14 [18] related RKHS embeddings to the problem of binary classification by showing that γk (P, Q) is the negative of the optimal risk associated with the Parzen window classifier in H. [sent-32, score-0.193]
15 Extending this classifier to Banach space and studying the highlights/issues associated with this generalization will throw light on the same associated with more complex Banach space learning algorithms. [sent-33, score-0.084]
16 With this motivation, in this paper, we consider the generalization of the notion of RKHS embedding of probability measures to Banach spaces—in particular reproducing kernel Banach spaces (RKBSs) [24]—and then compare the properties of the RKBS embedding to its RKHS counterpart. [sent-34, score-0.584]
17 To derive RKHS based learning algorithms, it is essential to appeal to the Riesz representation theorem (as an RKHS is defined by the continuity of evaluation functionals), which establishes the existence of a reproducing kernel. [sent-35, score-0.153]
18 This theorem hinges on the fact that a notion of inner product can be defined on Hilbert spaces. [sent-36, score-0.107]
19 In this paper, as in [24], we deal with RKBSs that are uniformly Fr´ chet e differentiable and uniformly convex (called as s. [sent-37, score-0.156]
20 RKBS) as many Hilbert space arguments—most importantly the Riesz representation theorem—can be carried over to such spaces through the notion of semi-inner-product (s. [sent-40, score-0.177]
21 [24], who recently developed RKBS counterparts of RKHS based algorithms like regularization networks, support vector machines, kernel principal component analysis, etc. [sent-45, score-0.078]
22 In Section 4, first, we derive an RKBS embedding of P into B′ as P→ K(·, x) dP(x), (3) X where B is an s. [sent-51, score-0.114]
23 Based on (3), we define γK (P, Q) := X K(·, x) dP(x) − K(·, x) dQ(x) X , B′ a pseudo-metric on P(X ), which we show to be the negative of the optimal risk associated with the Parzen window classifier in B′ . [sent-58, score-0.071]
24 Second, we characterize the injectivity of (3) in Section 4. [sent-59, score-0.108]
25 1 wherein we show that the characterizations obtained for the injectivity of (3) are similar to those obtained for (1) and coincide with the latter when B is an RKHS. [sent-60, score-0.185]
26 We show that the consistency and the rate of convergence of the estimator depend on the Rademacher type of B′ . [sent-67, score-0.079]
27 On the other hand, one disadvantage of the RKBS framework is that γK (P, Q) cannot be computed in a closed form unlike γk (see Section 4. [sent-71, score-0.085]
28 Though this could seriously limit the practical impact of the RKBS embeddings, in Section 5, we show that closed form expression for γK and its empirical estimator can be obtained for some non-trivial Banach spaces (see Examples 1–3). [sent-73, score-0.195]
29 Given the advantages of learning in Banach space over Hilbert space, this work, therefore demonstrates the need for the 2 development of efficient algorithms in Banach spaces in order to make the problem of learning in Banach spaces worthwhile compared to its Hilbert space counterpart. [sent-75, score-0.313]
30 For a locally compact Hausdorff space X , f ∈ C(X ) is said to vanish at infinity if for every ǫ > 0 the set {x : |f (x)| ≥ ǫ} is compact. [sent-81, score-0.063]
31 For a Borel measure µ on X , Lp (X , µ) denotes the Banach space of p-power ˆ (p ≥ 1) µ-integrable functions. [sent-83, score-0.059]
32 An RKBS B on X is a reflexive Banach space of functions on X such that its topological dual B′ is isometric to a Banach space of functions on X and the point evaluations are continuous linear functionals on both B and B′ . [sent-91, score-0.168]
33 Theorem 2 in [24] shows that if B is an RKBS on X , then there exists a unique function K : X × X → C called the reproducing kernel (r. [sent-94, score-0.186]
34 Note that K satisfies K(x, y) = (K(x, ·), K(·, y))B and therefore K(·, x) and K(x, ·) are reproducing kernels for B and B′ respectively. [sent-97, score-0.125]
35 In order to have a substitute for inner products in the Banach space setting, [24] considered RKBS B that are uniformly Fr´ chet dife ferentiable and uniformly convex (referred to as s. [sent-110, score-0.21]
36 RKBS) as it allows Hilbert space arguments to be carried over to B—most importantly, an analogue to the Riesz representation theorem holds (see Theorem 3)—through the notion of semi-inner-product (s. [sent-113, score-0.133]
37 A Banach space B is said to be uniformly Fr´ chet differentiable if for e all f, g ∈ B, limt∈R,t→0 f +tg B − f B exists and the limit is approached uniformly for f, g in the t unit sphere of B. [sent-128, score-0.219]
38 space if it is both uniformly Fr´ chet differentiable and uniformly convex. [sent-133, score-0.198]
39 e Note that uniform Fr´ chet differentiability and uniform convexity are properties of the norm associe ated with B. [sent-134, score-0.075]
40 , [8, Theorem 6] provided the following analogue in B to the Riesz representation theorem of Hilbert spaces. [sent-155, score-0.062]
41 Let (X , A , µ) be a measure space and B := Lp (X , µ) for some p ∈ (1, +∞). [sent-179, score-0.059]
42 space with dual B′ := Lq (X , µ) where p−2 p q = p−1 . [sent-183, score-0.066]
43 spaces (see (a7 )), Theorem 9 in [24] shows that if B is an s. [sent-196, score-0.106]
44 kernel G : X × X → C such that: (a9 ) G(x, ·) ∈ B for all x ∈ X , K(·, x) = (G(x, ·))∗ , x ∈ X , (a10 ) f (x) = [f, G(x, ·)]B , f ∗ (x) = [K(x, ·), f ]B for all f ∈ B, x ∈ X . [sent-204, score-0.078]
45 in general do not satisfy conjugate symmetry, G need not be Hermitian nor pd [24, Section 4. [sent-209, score-0.113]
46 kernel G coincide when span{G(x, ·) : x ∈ X } is dense in B, which is the case when B is an RKHS [24, Theorems 2, 10 and 11]. [sent-216, score-0.12]
47 This means when B is an RKHS, then the conditions (a9 ) and (a10 ) reduce to the well-known reproducing properties of an RKHS with the s. [sent-217, score-0.108]
48 4 RKBS Embedding of Probability Measures In this section, we present our main contributions of deriving and analyzing the RKBS embedding of probability measures, which generalize the theory of RKHS embeddings. [sent-221, score-0.114]
49 First, we would like to remind the reader that the RKHS embedding in (1) can be derived by choosing F = {f : f H ≤ 1} in γF (P, Q) = sup f ∈F X f dP − f dQ . [sent-222, score-0.114]
50 Similar to the RKHS case, in Theorem 4, we show that the RKBS embeddings can be obtained by choosing F = {f : f B ≤ 1} in γF (P, Q). [sent-224, score-0.122]
51 Interestingly, though B does not have an inner product, it can be seen that the structure of semi-inner-product is sufficient enough to generate an embedding similar to (1). [sent-225, score-0.147]
52 kernel and K as the reproducing kernel with both G and K being measurable. [sent-233, score-0.264]
53 B′ (5) Based on Theorem 4, it is clear that P can be seen as being embedded into B′ as P → X K(·, x) dP(x) and γK (P, Q) is the distance between the embeddings of P and Q. [sent-236, score-0.122]
54 Therefore, we arrive at an embedding which looks similar to (1) and coincides with (1) when B is an RKHS. [sent-237, score-0.143]
55 Given these embeddings, two questions that need to be answered for these embeddings to be practically useful are: (⋆) When is the embedding injective? [sent-238, score-0.271]
56 The significance of (⋆) is that if (3) is injective, then such an embedding can be used to differentiate between different P and Q, which can then be used in applications like two-sample tests to differentiate between P and Q based on samples drawn i. [sent-243, score-0.131]
57 Following [18], it can be shown that γK is the negative of the optimal risk associated with a Parzen window classifier in B′ , that separates the class-conditional distributions P and Q (refer to the supplementary material for details). [sent-249, score-0.115]
58 Therefore, the injectivity of (3) is of primal importance in applications. [sent-253, score-0.108]
59 In addition, the question in (⋆⋆) is critical as well, as it relates to the consistency of the Parzen window classifier. [sent-254, score-0.084]
60 The following result provides various characterizations for the injectivity of (3), which are similar (but more general) to those obtained for the injectivity of (1) and coincide with the latter when B is an RKHS. [sent-257, score-0.26]
61 RKBS defined on a topological space X with K and G as its r. [sent-262, score-0.084]
62 Then (3) is injective if B is dense in Lp (X , µ) for any Borel probability measure µ on X and some p ∈ [1, ∞). [sent-272, score-0.132]
63 Since it is not easy to check for the denseness of B in C0 (X ) or Lp (X , µ), in Theorem 6, we present an easily checkable characterization for the injectivity of (3) when K is bounded continuous and translation invariant on Rd . [sent-273, score-0.154]
64 Note that Theorem 6 generalizes the characterization (see [19, 20]) for the injectivity of RKHS embedding (in (1)). [sent-274, score-0.222]
65 If ψ in Theorem 6 is a real-valued pd function, then by Bochner’s theorem, Λ has to be real, nonnegative and symmetric, i. [sent-281, score-0.119]
66 Since ψ need not be a pd function for K to be a real, symmetric r. [sent-284, score-0.14]
67 An example of a translation invariant, real and symmetric (but not pd) r. [sent-290, score-0.075]
68 Theorem 9 m generalizes the consistency result in [9] by showing that γK (Pm , Qn ) is a consistent estimator of 5 γK (P, Q) and the rate of convergence is O(m(1−t)/t + n(1−t)/t ) if B′ is of type t, 1 < t ≤ 2. [sent-302, score-0.079]
69 A Banach space B is said to be of tRademacher (or, more shortly, of type t) if there exists a constant C ∗ such that for any N ≥ 1 N N t 1/t and any {fj }N ⊂ B: (E ≤ C ∗ ( j=1 fj t )1/t , where {̺j }N are j=1 j=1 B j=1 ̺j fj B ) i. [sent-307, score-0.124]
70 Since having type t′ for t′ > t implies having type t, let us define t∗ (B) := sup{t : B has type t}. [sent-312, score-0.069]
71 This means γK (P, Q) is not representable in terms of the kernel function, K(x, y) unlike in the case of B being an RKHS, in which case the s. [sent-341, score-0.1]
72 structure does not allow closed form representations in terms of the kernel K (also see [24] where regularization algorithms derived in RKBS are not solvable unlike in an RKHS), and therefore could limit its practical viability. [sent-349, score-0.18]
73 RKBSs for which γK (P, Q) and γK (Pm , Qn ) can be obtained in closed forms. [sent-353, score-0.063]
74 5 Concrete Examples of RKBS Embeddings In this section, we present examples of RKBSs and then derive the corresponding γK (P, Q) and γK (Pm , Qn ) in closed forms. [sent-354, score-0.089]
75 is not symmetric and therefore not pd and Example 3 with RKBS whose r. [sent-361, score-0.157]
76 These examples show that the Banach space embeddings result in richer metrics on P(X ) than those obtained through RKHS embeddings. [sent-364, score-0.223]
77 Then for p any 1 < p < ∞ with q = p−1 Bpd (Rd ) := p u(t)ei fu (x) = x,t Rd dµ(t) : u ∈ Lp (Rd , µ), x ∈ Rd , is an RKBS with K(x, y) = G(x, y) = (µ(Rd ))(p−2)/p ei γK (P, Q) = x,· Rd d(P − Q)(x) Rd e−i Lq (Rd ,µ) x−y,t (7) dµ(t) as the r. [sent-367, score-0.067]
78 (8) First note that K is a translation invariant pd kernel on Rd as it is the Fourier transform of a nonnegative finite Borel measure, µ, which follows from Bochner’s theorem. [sent-370, score-0.243]
79 of an RKBS need not be symmetric, the space in (7) is an interesting example of an RKBS, which is induced by a pd kernel. [sent-376, score-0.133]
80 Refer to the supplementary material for an interpretation of Bpd (Rd ) as a generalization of Sobolev space [23, Chapter 10]. [sent-381, score-0.086]
81 Note that K is not symmetric (for q = 2) and therefore is not pd. [sent-393, score-0.066]
82 When p = q = 2, K(x, y) = Mµ (x + y) is pd and Bns (Rd ) is an RKHS. [sent-394, score-0.091]
83 Clearly, ψ and therefore K are not pd x2 (though symmetric on R) as ψ(x) = 4 −e− √ 34992 2 x6 − 39x4 + 216x2 − 324 is not nonnegative at every x ∈ R. [sent-400, score-0.185]
84 , derived the Banach space embeddings in closed form and studied the conditions under which it is injective. [sent-408, score-0.227]
85 These examples also show that the RKBS embeddings result in richer distance measures on probabilities compared to those obtained by the RKHS embeddings—an advantage gained by moving from Hilbert to Banach spaces. [sent-409, score-0.216]
86 Now, we consider the problem of computing γK (Pm , Qn ) in closed form and its consistency. [sent-410, score-0.063]
87 3, we showed that γK (Pm , Qn ) does not have a nice closed form expression unlike in the case of B being an RKHS. [sent-412, score-0.085]
88 However, in the following, we show that for K in Examples 1–3, γK (Pm , Qn ) has a closed form expression for certain choices of q. [sent-413, score-0.063]
89 Note that (9) cannot be computed in a closed form for all q—see the discussion in the supplementary material about approximating γK (Pm , Qn ). [sent-419, score-0.107]
90 However, when q = 2, (9) can be computed very efficiently in closed form (in terms of K) as a V-statistic [9], given by m 2 γK (Pm , Qn ) = j,l=1 K(Xj , Xl ) + m2 n j,l=1 m K(Yj , Yl ) −2 n2 j=1 n l=1 K(Xj , Yl ) . [sent-420, score-0.063]
91 ,xq ) q γK (Pm , Qn ) = q X ··· q s b(x2j−1 , t)b(x2j , t) dµ(t) X X j=1 j=1 d(Pm − Qn )(xj ) (11) for which closed form computation is possible for appropriate choices of b and µ. [sent-424, score-0.063]
92 By appropriately choosing θ and µ in Example 3, we j=1 can obtain a closed form expression for A(x1 , . [sent-433, score-0.063]
93 (11) shows that γK (Pm , Qn ) can be computed q in a closed form in terms of A at a complexity of O(m ), assuming m = n, which means the least complexity is obtained for q = 2. [sent-438, score-0.063]
94 , q ∈ (2, ∞), the RKBS embeddings in Examples 1–3 are useful in practice as γK (Pm , Qn ) is consistent and has a closed form expression. [sent-441, score-0.185]
95 While we showed that most of results in RKHS like injectivity of the embedding, consistency of the Parzen window classifier, etc. [sent-444, score-0.192]
96 , nicely generalize to RKBS yielding richer distance measures on probabilities, the generalized notion is less attractive in practice compared to its RKHS counterpart because of the computational disadvantage associated with it. [sent-445, score-0.097]
97 This, therefore raises an important open problem of developing computationally efficient Banach space based learning algorithms. [sent-447, score-0.059]
98 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-501, score-0.186]
99 Kernel choice o and classifiability for RKHS embeddings of probability distributions. [sent-612, score-0.122]
100 Hilbert space o embeddings and metrics on probability measures. [sent-645, score-0.164]
wordName wordTfidf (topN-words)
[('rkbs', 0.596), ('banach', 0.382), ('rkhs', 0.281), ('qn', 0.253), ('pm', 0.212), ('rkbss', 0.156), ('hilbert', 0.151), ('rd', 0.143), ('embeddings', 0.122), ('parzen', 0.115), ('embedding', 0.114), ('reproducing', 0.108), ('injectivity', 0.108), ('spaces', 0.106), ('injective', 0.097), ('pd', 0.091), ('riesz', 0.085), ('lq', 0.079), ('fukumizu', 0.078), ('kernel', 0.078), ('dp', 0.077), ('chet', 0.075), ('borel', 0.068), ('dq', 0.068), ('closed', 0.063), ('fr', 0.061), ('gretton', 0.058), ('bpd', 0.057), ('sch', 0.055), ('window', 0.054), ('symmetric', 0.049), ('lp', 0.048), ('theorem', 0.045), ('sriperumbudur', 0.043), ('topological', 0.042), ('space', 0.042), ('yj', 0.04), ('supp', 0.036), ('classi', 0.036), ('ei', 0.036), ('measures', 0.035), ('fourier', 0.034), ('richer', 0.033), ('wherein', 0.033), ('xj', 0.033), ('inner', 0.033), ('mq', 0.032), ('lkopf', 0.031), ('fu', 0.031), ('er', 0.03), ('consistency', 0.03), ('uniformly', 0.03), ('notion', 0.029), ('mp', 0.029), ('rkhss', 0.029), ('coincides', 0.029), ('bharath', 0.028), ('bns', 0.028), ('bochner', 0.028), ('nonnegative', 0.028), ('supplementary', 0.027), ('xq', 0.026), ('estimator', 0.026), ('translation', 0.026), ('examples', 0.026), ('lanckriet', 0.025), ('tg', 0.025), ('dual', 0.024), ('coincide', 0.024), ('generalizing', 0.024), ('drawback', 0.024), ('limt', 0.023), ('type', 0.023), ('unlike', 0.022), ('platt', 0.022), ('metric', 0.022), ('conjugate', 0.022), ('gert', 0.022), ('rademacher', 0.022), ('differentiable', 0.021), ('said', 0.021), ('suppose', 0.021), ('yl', 0.02), ('invariant', 0.02), ('characterizations', 0.02), ('fj', 0.019), ('answered', 0.018), ('functionals', 0.018), ('attainable', 0.018), ('editors', 0.018), ('dense', 0.018), ('uk', 0.018), ('distinguished', 0.017), ('analogue', 0.017), ('risk', 0.017), ('tests', 0.017), ('measure', 0.017), ('material', 0.017), ('questions', 0.017), ('therefore', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
2 0.13351054 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
Author: Mona Eberts, Ingo Steinwart
Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1
3 0.12420009 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
4 0.10308708 202 nips-2011-On the Universality of Online Mirror Descent
Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1
5 0.08869049 288 nips-2011-Thinning Measurement Models and Questionnaire Design
Author: Ricardo Silva
Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in fields such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of refining existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to find an optimal subset of a given size. For that, we first define an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other fields, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model specification of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantifies the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, defined by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 defines a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is briefly discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed specifically to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M specifies the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-defined, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of finding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efficiently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efficiently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be fit by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same first and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to define an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the first i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shuffles neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) define a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we define the local term HL (z, i) as HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i)) j∈S zj k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-defined Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simplified marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redefine the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simplifies marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the first element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a fixed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model fitness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conflates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufficient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a fixed choice K of indicators to be kept. 5 Experiments In this section, we first describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefficients Λi such that the reliability (defined in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difficulty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The first three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefficients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was filled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and fit it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to first narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the first formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model specification and transportability across studies. A measurement model built for a very specific population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer file]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: fixed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-first Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9
6 0.078633457 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
7 0.078475088 139 nips-2011-Kernel Bayes' Rule
8 0.076738179 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
9 0.066102378 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
10 0.065213069 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
11 0.062042728 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
12 0.050716281 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
13 0.050587524 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
14 0.048383564 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
15 0.047778159 22 nips-2011-Active Ranking using Pairwise Comparisons
16 0.046971813 263 nips-2011-Sparse Manifold Clustering and Embedding
17 0.045449164 198 nips-2011-On U-processes and clustering performance
18 0.044932939 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
19 0.044302151 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
20 0.042527575 171 nips-2011-Metric Learning with Multiple Kernels
topicId topicWeight
[(0, 0.112), (1, -0.009), (2, -0.039), (3, -0.077), (4, -0.044), (5, 0.019), (6, 0.034), (7, -0.066), (8, 0.02), (9, 0.004), (10, -0.124), (11, 0.123), (12, 0.062), (13, 0.064), (14, -0.01), (15, 0.009), (16, 0.001), (17, 0.061), (18, -0.072), (19, 0.144), (20, -0.012), (21, -0.128), (22, -0.047), (23, 0.047), (24, -0.021), (25, 0.073), (26, 0.026), (27, -0.052), (28, -0.086), (29, 0.029), (30, 0.054), (31, -0.016), (32, -0.042), (33, -0.078), (34, 0.006), (35, 0.026), (36, 0.059), (37, -0.046), (38, -0.043), (39, -0.03), (40, 0.038), (41, -0.049), (42, 0.072), (43, 0.024), (44, -0.121), (45, 0.086), (46, 0.044), (47, 0.088), (48, -0.087), (49, -0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.94226718 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
2 0.66361707 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
Author: Mona Eberts, Ingo Steinwart
Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1
3 0.64448971 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
4 0.52275836 139 nips-2011-Kernel Bayes' Rule
Author: Kenji Fukumizu, Le Song, Arthur Gretton
Abstract: A nonparametric kernel-based method for realizing Bayes’ rule is proposed, based on kernel representations of probabilities in reproducing kernel Hilbert spaces. The prior and conditional probabilities are expressed as empirical kernel mean and covariance operators, respectively, and the kernel mean of the posterior distribution is computed in the form of a weighted sample. The kernel Bayes’ rule can be applied to a wide variety of Bayesian inference problems: we demonstrate Bayesian computation without likelihood, and filtering with a nonparametric statespace model. A consistency rate for the posterior estimate is established. 1
5 0.4869642 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
Author: Taiji Suzuki
Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1
6 0.4775756 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
7 0.46279114 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
8 0.42515129 59 nips-2011-Composite Multiclass Losses
9 0.42186955 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
10 0.41417563 202 nips-2011-On the Universality of Online Mirror Descent
11 0.40513071 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
12 0.40488344 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
13 0.33896521 288 nips-2011-Thinning Measurement Models and Questionnaire Design
14 0.32753974 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
15 0.32538837 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
16 0.32192406 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
17 0.32154167 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
18 0.31912541 171 nips-2011-Metric Learning with Multiple Kernels
19 0.30669862 256 nips-2011-Solving Decision Problems with Limited Information
20 0.30484906 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
topicId topicWeight
[(0, 0.098), (4, 0.075), (7, 0.277), (20, 0.04), (26, 0.028), (31, 0.076), (33, 0.011), (43, 0.099), (45, 0.07), (48, 0.014), (57, 0.018), (65, 0.012), (74, 0.029), (83, 0.024), (99, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.78082687 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur
Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.
2 0.6056965 86 nips-2011-Empirical models of spiking in neural populations
Author: Jakob H. Macke, Lars Buesing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani
Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-offit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts. 1
3 0.55557668 32 nips-2011-An Empirical Evaluation of Thompson Sampling
Author: Olivier Chapelle, Lihong Li
Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
4 0.53110766 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
5 0.52923113 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
6 0.52439564 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
7 0.52328217 139 nips-2011-Kernel Bayes' Rule
8 0.52106082 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
9 0.51738602 213 nips-2011-Phase transition in the family of p-resistances
10 0.51317626 29 nips-2011-Algorithms and hardness results for parallel large margin learning
11 0.51163983 199 nips-2011-On fast approximate submodular minimization
12 0.51075584 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
13 0.50880134 22 nips-2011-Active Ranking using Pairwise Comparisons
14 0.50870359 198 nips-2011-On U-processes and clustering performance
15 0.50860298 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
16 0.50786614 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
17 0.50597686 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
18 0.5048365 186 nips-2011-Noise Thresholds for Spectral Clustering
19 0.50157809 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
20 0.50026071 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators