nips nips2011 nips2011-295 knowledge-graph by maker-knowledge-mining

295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Source: pdf

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a uniﬁed framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be uniﬁed under the minimum KL contraction framework with different choices of the KL contraction operators. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. [sent-3, score-0.149]

2 Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a uniﬁed framework to understand them. [sent-4, score-0.08]

3 Given a set of training data {x(1) , · · · , x(n) }, parameter learning aims to ﬁnd a member in a parametric distribution family, qθ , to best represent the training data. [sent-8, score-0.079]

4 In practice, many useful high dimensional parametric probabilistic models, such as Markov random ﬁelds [18] or products of experts [12], are deﬁned as qθ (x) = qθ (x)/Z(θ), where qθ is the unnormalized model and ˜ ˜ Z(θ) = Rd qθ (x)dx is the partition function. [sent-9, score-0.118]

5 The maximum (log) likelihood (ML) estimation is ˜ the most commonly used method for parameter learning, where the optimal parameter is obtained n 1 by solving argmaxθ n k=1 log qθ (x(k) ). [sent-10, score-0.15]

6 The KL contraction operator is a mapping between 1 probability distributions under which the KL divergence of two distributions tend to reduce unless they are equal. [sent-16, score-0.867]

7 This connection is further generalized in [1], which shows that parameter update in another variant of contrastive divergence is equivalent to a stochastic parameter update in conditional composite likelihood [24]. [sent-20, score-0.432]

8 However, such similarities in numerical implementations are only tangential to the more fundamental relationship among the objective functions of different non-ML learning methods. [sent-21, score-0.08]

9 For instance, it is known that pseudo-likelihood is a special case of conditional composite likelihood [30]. [sent-24, score-0.171]

10 For two different probability distributions p, q ∈ Ωd , their KulbackLeibler (KL) divergence (also known as relative entropy or I-divergence) [6] is deﬁned as KL(p q) = p(x) log p(x) dx. [sent-29, score-0.158]

11 KL divergence is non-negative and equals to zero if and only if p = q almost q(x) Rd everywhere (a. [sent-30, score-0.093]

12 A ˜ ˜ distribution p is a ﬁx point of a distribution operator Φ if p = Φ{p}. [sent-34, score-0.186]

13 A KL contraction operator is a distribution operator, Φ : Ωd → Ωd , such that for any p, q ∈ Ωd , there exist a constant β ≥ 1 for the following condition to hold: KL(p q) − βKL(Φ{p} Φ{q}) ≥ 0. [sent-35, score-0.73]

14 (1) p DKL (p q ) q p = Φ{p} ˜ q = Φ{q} ˜ Subsequently, β is known as the contraction factor, and p ˜ D LHS of Eq. [sent-36, score-0.566]

15 In addition, a KL contraction operator is strict if the equality in Eq. [sent-41, score-0.786]

16 Intuitively, if the KL divergence Figure 1: Illustration of a KL contraction operator on two density functions p and q. [sent-45, score-0.829]

17 is regarded as a “distance” metric of probability distri1 butions , then it is never increased after both distributions are transformed with a KL contraction operator, a graphical illustration of which is shown in Fig. [sent-46, score-0.623]

18 Furthermore, under a strict KL contraction operator, the KL divergence is always reduced unless the two distributions are equal (a. [sent-48, score-0.752]

19 The KL contraction operators are analogous to the contraction operators in ordinary metric spaces, with β having a similar role as the Lipschitz constant [19]. [sent-51, score-1.306]

20 1 Indeed, it is known that the KL divergence behaves like the squared Euclidean distance [6]. [sent-52, score-0.093]

21 (1) gives the general and abstract deﬁnition of KL contraction operators. [sent-54, score-0.566]

22 In the following, we give several examples of KL contraction operators that are constructed from common operations of probability distributions. [sent-55, score-0.679]

23 1 Conditional Distribution We can form a family of KL contraction operators using conditional distributions. [sent-57, score-0.697]

24 Consider x ∈ Rd with distribution p(x) ∈ Ωd and y ∈ Rd , from a conditional distribution, T (y|x), we can deﬁne a distribution operator, as Φc {p}(y) = T T (y|x)p(x)dx = p(y). [sent-58, score-0.088]

25 ˜ (2) Rd The following result shows that Φc is a strict KL contraction operator with β = 1. [sent-59, score-0.768]

26 T Lemma 1 (Cover & Thomas [6]2 ) For two distributions p, q ∈ Ωd , with the distribution operator deﬁned in Eq. [sent-60, score-0.197]

27 (2), we have KL(p q) − KL(Φc {p} Φc {q}) = T T p(y)KL(Tp (x|y) Tq (x|y)) dy ≥ 0, ˜ Rd T (y|x)q(x) q (y) ˜ where Tp (x|y) = T (y|x)p(x) and Tq (x|y) = are the induced conditional distributions p(y) ˜ from p and q with T . [sent-61, score-0.096]

28 2 Marginalization and Marginal Grafting Two related types of KL contraction operators can be constructed based on marginal distributions. [sent-66, score-0.732]

29 This marginalization operation thus deﬁnes a distribution operator between p ∈ Ωd and pA ∈ Ω|A| , as: Φm {p}(xA ) = A p(x)dx\A = pA (xA ) (3) Rd−|A| Another KL contraction operator termed as marginal grafting can also be deﬁned based on pA . [sent-69, score-1.044]

30 For a distribution q(x) ∈ Ωd , the marginal grafting operator is deﬁned as: q(x)pA (xA ) = q\A|A (x\A |xA )pA (xA ), (4) Φg {q}(x) = p,A qA (xA ) Φg {q} can be understood as replacing qA in q(x) with pA . [sent-70, score-0.279]

31 Furthermore, p is the ﬁxed point of operator Φg , as Φg {p} = p. [sent-73, score-0.142]

32 p,A p,A The following result shows that both Φm and Φmg are KL contraction operators, and that they are A p,A in a sense complementary to each other. [sent-74, score-0.566]

33 Lemma 2 (Huber [13]) For two distributions p, q ∈ Ωd , with the distribution operators deﬁned in Eq. [sent-75, score-0.142]

34 A A p,A p,A Furthermore, KL Φg {p} Φg {q} = p,A p,A pA (xA )KL p\A|A (·|xA ) q\A|A (·|xA ) dxA , Rd where p\A|A (·|xA ) and q\A|A (·|xA ) are the conditional distributions induced from p(x) and q(x), and KL(Φm {p} Φm {q}) = KL(pA (xA ) qA (xA )) . [sent-78, score-0.096]

35 A A Lemma 2 also indicates that neither Φm nor Φmg is strict - the KL contraction of p(x) and q(x) for A p,A the former is zero if p\A|A (x\A |xA ) = q\A|A (x\A |xA ) (a. [sent-79, score-0.626]

36 ), even though they may differ on the marginal distribution over xA . [sent-81, score-0.075]

37 And for the latter, having pA (xA ) = qA (xA ) is sufﬁcient to make their KL contraction zero. [sent-82, score-0.566]

38 ˜ g The following result shows that Φb g (5) is a strict KL contraction operator with β = 1/π1 . [sent-89, score-0.768]

39 Lemma 3 For two distributions p, q ∈ Ωd , with the distribution operator deﬁned in Eq. [sent-90, score-0.197]

40 (5), we have 1 1 KL(p q) − KL Φb {p} Φb {q} = p(x) KL pc|x (c|x) qc|x (c|x) dx ≥ 0, ˜ g g π1 π1 Rd where p(c|x) and q(c|x) are the induced posterior conditional distributions from p(x, c) and q (x, c), ˆ ˆ respectively. [sent-91, score-0.178]

41 (6) The following result shows that Φl is a KL contraction operator with β = 1. [sent-97, score-0.708]

42 S Lemma 4 (Csisz` r & Shields [8]) For two distributions p, q ∈ Ωd , with the distribution operator a deﬁned in Eq. [sent-98, score-0.197]

43 (6), we have m KL(p q) − KL Φl {p} Φl {p} = S S where pi (x) = ˜ p(x)×1[x∈S ] i p(x )dx x ∈S and qi (x) = ˜ i PiS KL(˜i qi) ≥ 0, p ˜ i=1 q(x)×1[x∈S ] i are q(x )dx x ∈S the distributions induced from p and i q by restricting to Si , respectively, with 1[·] being the indicator function. [sent-99, score-0.122]

44 ˜ ˜ Note that Φl is in general not strict, as two distributions agree over all pi and qi will have a zero KL S contraction. [sent-100, score-0.115]

45 In this context, the maximum (log) likelihood learning is equivalent to ﬁnding parameter θ that minimizes the KL divergence of p and qθ [8], as argminθ KL(p qθ) = argmaxθ Rd p(x) log qθ (x)dx. [sent-102, score-0.231]

46 The data based ML objective is obtained when we approximate the expectation with sample average as n 1 p(x) log qθ (x)dx ≈ n k=1 log qθ (x(k) ). [sent-103, score-0.133]

47 Rd The KL contraction operators suggest an alternative approach for parametric learning. [sent-104, score-0.698]

48 In particular, the KL contraction of p and qθ under a KL contraction operator is always nonnegative and reaches zero when p and qθ are equal almost everywhere. [sent-105, score-1.288]

49 Therefore, we can minimize their KL contraction under a KL contraction operator to encourage the matching of qθ to p. [sent-106, score-1.297]

50 We term this general approach of parameter learning as minimum KL contraction (MKC). [sent-107, score-0.603]

51 Mathematically, minimum KL contraction may be realized with three different but related types of objective functions. [sent-108, score-0.66]

52 - Type I: With a KL contraction operator Φ, we can ﬁnd optimal θ that directly minimizes the KL contraction between p and qθ , as: argmin KL(p qθ) − βKL(Φ{p} Φ{qθ }) . [sent-109, score-1.372]

53 (7) θ In practice, it may be desirable to use Φ with β = 1 that is also a linear operator for L2 bounded ˜ functions over Rd [19]. [sent-110, score-0.153]

54 (7) can be approximated as n n 1 1 (k) argmin KL(p qθ)−KL(Φ{p} Φ{qθ }) ≈ argmax log qθ (x )− ˜ log Φ{˜θ }(y(k) ), q n n θ θ k=1 k=1 where due to the linearity of Φ, the two terms of Z(θ) in qθ and L{qθ } cancel out each other. [sent-113, score-0.214]

55 Type I MKC objective functions with KL contraction operators induced from conditional distribution, marginalization, marginal grafting, linear transform, and lumping all fall into this category. [sent-115, score-0.88]

56 However, using nonlinear KL contraction operators, such as the one induced from binary mixtures, may also be able to avoid computing the partition function (e. [sent-116, score-0.61]

57 Last, note that when using Type I MKC objective functions with a non-strict KL contraction operator, we cannot guarantee p = qθ even if their corresponding KL contraction is zero. [sent-126, score-1.212]

58 - Type II: Consider a strict KL contraction operator with β = 1, denoted as Φt , is parameterized by an auxiliary parameter t that is different from θ, and for any distribution p ∈ Ωd , we have Φ0 {p} = p and Φt {p} is continuously differentiable with regards to t. [sent-127, score-0.875]

59 Then, the KL divergence Φt {p} and Φt {qθ } can be regarded as a function of t and θ, as: L(t, θ) = KL(Φt {p} Φt {qθ). [sent-128, score-0.104]

60 If the derivative of KL contrac∂t t=0 tion with regards to t is easier to work with than the KL contraction itself (e. [sent-131, score-0.639]

61 5), we can ﬁx δt and equivalently maximizing the derivative, which is the Type II MKC objective function, as ∂ argmax KL(Φt {p} Φt {qθ }) . [sent-134, score-0.136]

62 (8) ∂t θ t=0 - Type III: In the case where we have access to a set of different KL contraction operators, {Φ1 , · · · , Φm }, we can implement the minimum KL contraction principle by ﬁnding optimal θ that minimizes their average KL contraction, as m 1 (KL(p qθ) − βi KL(Φi {p} Φi {qθ })) . [sent-135, score-1.191]

63 (9) argmin m i=1 θ As each KL contraction in the sum is nonnegative, Eq. [sent-136, score-0.649]

64 (9) is zero if and only if each KL contraction is zero. [sent-137, score-0.566]

65 If the consistency of p and qθ with regards to Φi corresponds to certain constraints on qθ , the objective function, Eq. [sent-138, score-0.141]

66 (9) to zero over a sufﬁcient number of certain types of KL contraction operators may indeed ensure equality of p and qθ (e. [sent-141, score-0.671]

67 Using the strict KL contraction operator Φc constructed with a Gaussian conditional distribution T (y − x)2 1 exp − , T (y|x) = 2 2 2σ1 2πσ1 2 with known variance σ1 , we form the Type I MKC objective function. [sent-147, score-0.929]

68 (7) is reduced to a closed form objective function, as: 2 2 2 σ0 σ0 + σ1 1 σ2 σ 2 (µ − µ0 )2 argmin − + log 2 + 12 2 , 2 2 2 2σ 2 2(σ 2 + σ1 ) 2 σ + σ1 2σ (σ + σ1 ) µ,σ 2 2 whose optimal solution, µ = µ0 and σ 2 = σ0 , is obtained by direct differentiation. [sent-149, score-0.184]

69 2 Relation with Contrastive Divergence [12] Next, we consider the general strict KL contraction operator Φc θ constructed from a conditional T distribution, Tθ (y|x), for x, y ∈ Rd , of which the parametric model qθ is a ﬁxed point, as qθ (y) = T (y|x)qθ (x)dx = Φc θ {qθ }(y). [sent-153, score-0.883]

70 The Type I objective function of minimum KL contraction, Eq. [sent-155, score-0.094]

71 (7), for p, qθ ∈ Ωd under Φc θ is T argmin KL(p qθ) − KL Φc θ {p} Φc θ {qθ } = argmin KL(p qθ) − KL(pθ qθ) , T T θ θ Φc θ {p}. [sent-156, score-0.166]

72 T where pθ is the shorthand notation for Note that this is the objective function of the contrastive divergence learning [12]. [sent-157, score-0.294]

73 By ignoring this dependency, the practical parameter update in contrastive divergence only approximately follows the gradient of this objective function [5]. [sent-159, score-0.306]

74 (7), combined with the KL contraction operator constructed from lumping. [sent-162, score-0.734]

75 Using Lemma 4, we have m argmin KL(p qθ) − KL Φl {p} Φl {qθ } S S θ θ m = PiS argmax θ pi (x) log qi (x)dx ≈ argmax ˜ ˜θ θ x∈Si i=1 PiS KL pi qi ˜ ˜θ = argmin 1 n i=1 m n PiS log qi (x(k) ), ˜θ 1[x(k) ∈Si ] i=1 k=1 where {x(1) , · · · , x(n) } are samples from p(x). [sent-163, score-0.557]

76 Minimizing KL contraction in this case is equivalent to maximizing the weighted sum of log likelihood of the probability distributions formed by restricting the overall model to subsets of state space. [sent-164, score-0.706]

77 The last step resembles the partial likelihood objective function [7], which is recently rediscovered in the context of discriminative learning as non-local contrastive objectives [31]. [sent-165, score-0.361]

78 (7), combined with the strict KL contraction operator constructed from the binary mixture operation (Lemma 3). [sent-169, score-0.794]

79 (7) using the deﬁnition of Φb , as: g argmin KL(p qθ) − θ = argmin θ = 1 π1 argmax θ Rd 1 KL Φb {p} Φb {qθ } g g π1 (π0 g(x) + π1 p(x)) log (π0 g(x) + π1 qθ (x)) dx − p(x) log Rd π1 qθ (x) π0 dx + π0 g(x) + π1 qθ (x) π1 g(x) log Rd p(x) log qθ (x)dx Rd π0 g(x) dx. [sent-171, score-0.525]

80 5 Relation with Score Matching [14] Next, we consider the strict KL contraction operator, Φc t , constructed from an isotropic Gaussian T conditional distribution with a time decaying variance (i. [sent-175, score-0.718]

81 T If both p(x) and qθ (x) are functions differentiable with regards to x, it is know that the temporal derivative of the KL contraction of p and qθ under Φc t is in closed form, which is formally stated in T the following result. [sent-179, score-0.663]

82 Lemma 5 (Lyu [25]) For any two distributions p, q ∈ Ωd differentiable with regards to x, we have d 1 KL Φc t {p} Φc t {qθ } = − T T dt 2 where x Rd c x ΦTt p(x) Φc t p(x) T Φc t {p}(x) T − 2 c x ΦTt qθ (x) Φc t qθ (x) T dx, (10) is the gradient operator with regards to x. [sent-180, score-0.308]

83 The last step is the objective function of score matching learning [14]. [sent-184, score-0.13]

84 (7), combined with the KL contraction operator, Φm , constructed from marginalization. [sent-187, score-0.592]

85 According to Lemma 2, we A have argminθ KL(p q) − KL(Φm {p} Φm {q}) = argmaxθ Rd p(x) log q\A|A (x\A |xA )dx ≈ A A argmaxθ 1 n n k=1 (k) (k) log q\A|A (x\A |xA ), where in the last step, expectation over p(x) is replaced with averages over samples from p(x), {x(1) , · · · , x(n) }. [sent-188, score-0.104]

86 This corresponds to the objective function in maximum conditional likelihood [17] or maximum mutual information [2], which are non-ML learning objectives for discriminative learning of high dimensional probabilistic data models. [sent-189, score-0.306]

87 (9), to combine KL contraction operators formed from marginalizations over m different index subsets A1 , · · · , Am : argmin KL(p q) − θ 1 m m i=1 KL Φmi {p} Φmi {q} ≈ argmax A A θ 1 m m i=1 1 n n (k) k=1 (k) log qAi |\Ai (xAi |x\Ai ). [sent-192, score-0.849]

88 This is the objective function in conditional composite likelihood [24, 30, 23, 1] (also rediscovered under the name piecewise learning in [26]). [sent-193, score-0.259]

89 A special case of conditional composite likelihood is when Ai = \{i}, the resulting marginalization operator, Φm , is known as the ith singleton marginalization operator. [sent-194, score-0.27]

90 With the d dif\{i} ferent singleton marginalization operators, we can rewrite the objective function as KL(p q) − 1 d d i=1 KL Φm p Φm q \i \i = 1 d d i=1 R pi (xi )KL pi|\i (xi |x\i ) qi|\i (xi |x\i ) dxi . [sent-195, score-0.16]

91 Note that in this case, the average KL contraction is zero if and only if p(x) and qθ (x) agree on every singleton conditional distribution, i. [sent-196, score-0.645]

92 (9), with the KL contraction operator constructed from the marginal grafting operation. [sent-205, score-0.849]

93 With m = 1, the resulting objective is used in the maximum marginal likelihood or Type-II likelihood learning [9]. [sent-207, score-0.262]

94 First, the proposed minimum KL contraction framework may be further generalized using the more general f -divergence [8], of which the KL divergence is a special case. [sent-210, score-0.684]

95 While many non-ML learning methods covered in this work have been shown to be consistent individually, the uniﬁcation based on the minimum KL contraction may provide a general condition for such asymptotic properties. [sent-215, score-0.606]

96 Last, understanding different existing non-ML learning objectives through minimizing KL contraction also provides a principled approach to devise new non-ML learning methods, by seeking new KL contraction operators, or new combinations of existing KL contraction operators. [sent-216, score-1.77]

97 Bregman divergence as general framework to estimate unnormalized statistical models. [sent-272, score-0.128]

98 Connections between score matching, contrastive divergence, and pseudolikea lihood for continuous-valued variables. [sent-295, score-0.146]

99 Maximum conditional likelihood via bound maximization and the CEM algorithm. [sent-304, score-0.105]

100 A note on composite likelihood inference and model selection. [sent-370, score-0.127]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kl', 0.673), ('contraction', 0.566), ('mkc', 0.189), ('xa', 0.151), ('operator', 0.142), ('contrastive', 0.12), ('divergence', 0.093), ('operators', 0.087), ('rd', 0.084), ('argmin', 0.083), ('dx', 0.082), ('xai', 0.071), ('objective', 0.069), ('argmax', 0.067), ('composite', 0.066), ('grafting', 0.062), ('likelihood', 0.061), ('strict', 0.06), ('regards', 0.06), ('objectives', 0.054), ('marginal', 0.053), ('pis', 0.052), ('pa', 0.051), ('qai', 0.047), ('parametric', 0.045), ('conditional', 0.044), ('qi', 0.04), ('hyv', 0.039), ('marginalization', 0.038), ('gutmann', 0.038), ('unnormalized', 0.035), ('distributions', 0.033), ('qa', 0.033), ('log', 0.032), ('lumping', 0.031), ('type', 0.03), ('pi', 0.03), ('lemma', 0.03), ('constructed', 0.026), ('score', 0.026), ('partition', 0.025), ('minimum', 0.025), ('ml', 0.024), ('lyu', 0.024), ('pseudolikelihood', 0.024), ('matching', 0.023), ('singleton', 0.023), ('distribution', 0.022), ('pai', 0.021), ('uni', 0.021), ('si', 0.021), ('tt', 0.02), ('principle', 0.019), ('termed', 0.019), ('mg', 0.019), ('rediscovered', 0.019), ('relation', 0.019), ('induced', 0.019), ('minimizing', 0.018), ('equality', 0.018), ('maximum', 0.018), ('mutual', 0.018), ('tq', 0.017), ('csisz', 0.017), ('biometrika', 0.017), ('density', 0.017), ('fitting', 0.016), ('partial', 0.015), ('estimation', 0.015), ('tp', 0.015), ('averages', 0.015), ('asymptotic', 0.015), ('minimizes', 0.015), ('iii', 0.014), ('nonnegative', 0.014), ('formed', 0.014), ('differentiable', 0.013), ('transformed', 0.013), ('geometric', 0.013), ('derivative', 0.013), ('dimensional', 0.013), ('uai', 0.013), ('samples', 0.013), ('shorthand', 0.012), ('developments', 0.012), ('consistency', 0.012), ('aistats', 0.012), ('parameter', 0.012), ('last', 0.012), ('seek', 0.012), ('update', 0.012), ('unifying', 0.012), ('agree', 0.012), ('regarded', 0.011), ('discussions', 0.011), ('discriminative', 0.011), ('furthermore', 0.011), ('bregman', 0.011), ('functions', 0.011), ('ai', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

2 0.09939272 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Author: David A. Knowles, Tom Minka

Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1

3 0.080953926 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high ﬂuctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overﬁts even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1

4 0.067189284 306 nips-2011-t-divergence Based Approximate Inference

Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan

Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by deﬁning a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1

5 0.066522434 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation conﬁrms the method’s applicability to learning both stationary and temporal spike patterns. 1

6 0.066354081 162 nips-2011-Lower Bounds for Passive and Active Learning

7 0.063954614 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

8 0.060839556 171 nips-2011-Metric Learning with Multiple Kernels

9 0.055319566 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

10 0.053581376 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

11 0.053457893 102 nips-2011-Generalised Coupled Tensor Factorisation

12 0.050781298 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

13 0.050186496 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

14 0.047128361 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

15 0.041726347 229 nips-2011-Query-Aware MCMC

16 0.041361935 288 nips-2011-Thinning Measurement Models and Questionnaire Design

17 0.039444212 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

18 0.038522616 283 nips-2011-The Fixed Points of Off-Policy TD

19 0.037927527 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

20 0.034813929 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.1), (1, -0.013), (2, 0.011), (3, -0.048), (4, -0.036), (5, -0.026), (6, -0.01), (7, -0.05), (8, 0.016), (9, -0.008), (10, -0.045), (11, -0.026), (12, 0.033), (13, -0.007), (14, -0.01), (15, -0.042), (16, 0.0), (17, 0.018), (18, -0.019), (19, 0.099), (20, -0.052), (21, -0.016), (22, 0.085), (23, 0.046), (24, 0.044), (25, 0.002), (26, 0.033), (27, 0.004), (28, -0.016), (29, 0.097), (30, 0.01), (31, 0.033), (32, -0.065), (33, 0.007), (34, -0.017), (35, -0.018), (36, 0.013), (37, -0.024), (38, -0.015), (39, 0.032), (40, -0.167), (41, 0.019), (42, 0.044), (43, -0.027), (44, 0.031), (45, 0.094), (46, 0.036), (47, -0.052), (48, 0.008), (49, 0.14)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96459025 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

2 0.70114708 306 nips-2011-t-divergence Based Approximate Inference

Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan

3 0.68809438 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

4 0.56217551 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Author: David A. Knowles, Tom Minka

5 0.49982375 288 nips-2011-Thinning Measurement Models and Questionnaire Design

Author: Ricardo Silva

Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in ﬁelds such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of reﬁning existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to ﬁnd an optimal subset of a given size. For that, we ﬁrst deﬁne an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other ﬁelds, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model speciﬁcation of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantiﬁes the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, deﬁned by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 deﬁnes a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is brieﬂy discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed speciﬁcally to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M speciﬁes the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-deﬁned, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of ﬁnding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efﬁciently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efﬁciently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be ﬁt by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same ﬁrst and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to deﬁne an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the ﬁrst i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shufﬂes neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) deﬁne a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we deﬁne the local term HL (z, i) as    HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i))  j∈S zj   k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-deﬁned Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simpliﬁed marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redeﬁne the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simpliﬁes marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the ﬁrst element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a ﬁxed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model ﬁtness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conﬂates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufﬁcient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a ﬁxed choice K of indicators to be kept. 5 Experiments In this section, we ﬁrst describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefﬁcients Λi such that the reliability (deﬁned in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difﬁculty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The ﬁrst three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefﬁcients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was ﬁlled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and ﬁt it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to ﬁrst narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the ﬁrst formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model speciﬁcation and transportability across studies. A measurement model built for a very speciﬁc population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufﬁcient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer ﬁle]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: ﬁxed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-ﬁrst Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9

6 0.49824274 69 nips-2011-Differentially Private M-Estimators

7 0.48529217 123 nips-2011-How biased are maximum entropy models?

8 0.45255274 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

9 0.40800244 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

10 0.39886573 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation

11 0.39311212 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

12 0.3900525 240 nips-2011-Robust Multi-Class Gaussian Process Classification

13 0.38013285 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

14 0.37911737 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

15 0.37604249 14 nips-2011-A concave regularization technique for sparse mixture models

16 0.36798432 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

17 0.36619619 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

18 0.36333075 33 nips-2011-An Exact Algorithm for F-Measure Maximization

19 0.35990819 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

20 0.3586725 60 nips-2011-Confidence Sets for Network Structure

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (4, 0.029), (10, 0.312), (20, 0.015), (26, 0.029), (31, 0.062), (33, 0.031), (43, 0.052), (45, 0.097), (57, 0.033), (65, 0.016), (74, 0.064), (83, 0.051), (99, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78203911 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

2 0.7472229 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

Author: Matthias Hein, Simon Setzer

Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efﬁcient ﬁrst-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1

3 0.70330197 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Author: Philipp Krähenbühl, Vladlen Koltun

Abstract: Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random ﬁelds deﬁned over pixels or image regions. While regionlevel models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models deﬁned on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efﬁcient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are deﬁned by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy. 1

4 0.69391006 123 nips-2011-How biased are maximum entropy models?

Author: Jakob H. Macke, Iain Murray, Peter E. Latham

Abstract: Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mutual information in biological systems. However, maximum entropy models ﬁt to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspeciﬁcation can lead to much larger bias. We provide a perturbative approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data. 1

5 0.64913434 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they ﬁnd the complete information they need. We are speciﬁcally interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a conﬁdent diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) deﬁned over a ground set V is called submodular if it satisﬁes the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, inﬂuence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is deﬁned to be the set containing the ﬁrst i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order speciﬁed by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it ﬁrst in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have deﬁned it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can deﬁne F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a ﬁxed set of candidate diseases, then this function is submodular. Speciﬁcally, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The ofﬂine approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this ofﬂine algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single ﬁxed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem ﬁts in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard ofﬂine problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efﬁcient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best ﬁxed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an ofﬂine approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an ofﬂine algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this ofﬂine algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known ﬁnite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Ofﬂine Analysis In this work we present an algorithm for online submodular set cover which extends the ofﬂine algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this ofﬂine algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Ofﬂine Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in ofﬂine analysis the ofﬂine algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no beneﬁt but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is deﬁned such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it ﬁnds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Speciﬁcally, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), deﬁne a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) deﬁne a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next deﬁne truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one ﬁnal lemma before proving the main result of this section. For a sequence S deﬁne Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Deﬁne Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it sufﬁces to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it sufﬁces to show that after selecting the ﬁrst k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Deﬁne a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ deﬁnition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to ﬁx. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modiﬁcation does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the ﬁrst k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the ofﬂine setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Deﬁne F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to ﬁrst select the two broad actions covering all common ads then select the narrow actions in any order. However, the ofﬂine cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for ofﬂine solutions as baselines. As seen in the ﬁgure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the ofﬂine cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We deﬁne an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Deﬁne the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Deﬁne F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Speciﬁcally, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Deﬁne F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netﬂix’s Watch Instantly service and use 803 questions based on those we used for an ofﬂine problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modiﬁcation of Normal-Hedge which ﬁxes ct = 1 for a ﬁxed, more aggressive learning rate than theoretically justiﬁed. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed signiﬁcantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modiﬁed Normal-Hedge has no theoretical justiﬁcation, so it may not generalize to other problems. For the modiﬁed Normal-Hedge the ﬁnal average cover times are 7.72 adaptive and 8.22 cumulative. The ofﬂine values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a ﬁxed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efﬁcient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

6 0.52289957 227 nips-2011-Pylon Model for Semantic Segmentation

7 0.49890488 199 nips-2011-On fast approximate submodular minimization

8 0.48899949 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

9 0.4786748 32 nips-2011-An Empirical Evaluation of Thompson Sampling

10 0.47420847 277 nips-2011-Submodular Multi-Label Learning

11 0.47229591 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

12 0.46839708 186 nips-2011-Noise Thresholds for Spectral Clustering

13 0.46150798 276 nips-2011-Structured sparse coding via lateral inhibition

14 0.4593578 258 nips-2011-Sparse Bayesian Multi-Task Learning

15 0.45708743 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

16 0.45690563 265 nips-2011-Sparse recovery by thresholded non-negative least squares

17 0.4568634 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

18 0.45680028 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

19 0.45660719 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

20 0.45615768 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs