nips nips2012 nips2012-359 knowledge-graph by maker-knowledge-mining

359 nips-2012-Variational Inference for Crowdsourcing

Source: pdf

Author: Qiang Liu, Jian Peng, Alex Ihler

Abstract: Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean ﬁeld (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [1], while our MF method is closely related to a commonly used EM algorithm. In both cases, we ﬁnd that the performance of the algorithms critically depends on the choice of a prior distribution on the workers’ reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. [sent-8, score-0.094]

2 We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean ﬁeld (MF). [sent-9, score-0.118]

3 We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [sent-10, score-0.174]

4 Resources such as Amazon Mechanical Turk provide markets where the requestors can post tasks known as HITs (Human Intelligence Tasks) and collect large numbers of labels from hundreds of online workers (or annotators) in a short time and with relatively low cost. [sent-14, score-0.374]

5 The baseline majority voting heuristic, which simply assigns the label returned by the majority of the workers, is known to be error-prone, because it counts all the annotators equally. [sent-18, score-0.311]

6 A line of early work builds simple models characterizing the annotators using confusion matrices, and infers the labels using the EM algorithm [e. [sent-21, score-0.091]

7 On the other hand, variational approaches, including the popular belief propagation (BP) and mean ﬁeld (MF) methods, provide powerful inference tools for probabilistic graphical models [10, 11]. [sent-30, score-0.118]

8 To our knowledge, no previous attempts have taken advantage of variational tools for the crowdsourcing problem. [sent-34, score-0.103]

9 In this work, we approach the crowdsourcing problems using tools and concepts from variational inference methods for graphical models. [sent-40, score-0.136]

10 First, we present a belief-propagation-based method, which we show includes both KOS and majority voting as special cases, in which particular prior distributions are assumed on the workers’ abilities. [sent-41, score-0.249]

11 2 Background Assume there are M workers and N tasks with binary labels {±1}. [sent-46, score-0.374]

12 Denote by zi ∈ {±1}, i ∈ [N ] the true label of task i, where [N ] represents the set of ﬁrst N integers; Nj is the set of tasks labeled by worker j, and Mi the workers labeling task i. [sent-47, score-0.611]

13 The task assignment scheme can be represented by a bipartite graph where an edge (i, j) denotes that the task i is labeled by the worker j. [sent-48, score-0.18]

14 The labeling results form a matrix L ∈ {0, ±1}N ×M , where Lij ∈ {±1} denotes the answer if worker j labels task i, and Lij = 0 if otherwise. [sent-49, score-0.14]

15 The goal is to ﬁnd an optimal estimator z of the true labels z given ˆ 1 the observation L, minimizing the average bit-wise error rate N i∈[N ] prob[ˆi = zi ]. [sent-50, score-0.201]

16 z We assume that all the tasks have the same level of difﬁculty, but that workers may have different predictive abilities. [sent-51, score-0.333]

17 [1], we initially assume that the ability of worker j is measured by a single parameter qj , which corresponds to their probability of correctness: qj = prob[Lij = zi ]. [sent-53, score-1.364]

18 The values of qj reﬂect the abilities of the workers: qj ≈ 1 correspond to experts that provide reliable answers; qj ≈ 1/2 denote spammers that give random labels independent of the questions; and qj < 1/2 denote adversaries that tend to provide opposite answers. [sent-57, score-2.578]

19 We assume the qj of all workers are drawn independently from a common prior p(qj |θ), where θ are the hyper-parameters. [sent-59, score-0.952]

20 To avoid the cases when adversaries and/or spammers overwhelm the system, it is reasonable to α−1 require that E[qj |θ] > 1/2. [sent-60, score-0.152]

21 Typical priors include the Beta prior p(qj |θ) ∝ qj (1 − qj )β−1 and discrete priors, e. [sent-61, score-1.33]

22 The majority voting (MV) method aggregates the workers’ labels by zi ˆmajority = sign[ Lij ]. [sent-66, score-0.357]

23 j∈Mi The limitation of MV is that it weights all the workers equally, and performs poorly when the qualities of the workers are diverse, especially when adversarial workers exist. [sent-67, score-0.901]

24 Weighting the workers properly requires estimating their abilities qj , usually via a maximum a posteriori estimator, q = arg max log p(q|L, θ) = log z p(q, z|L, θ). [sent-69, score-0.962]

25 Assuming a Beta(α, β) prior on qj , EM is formulated as 2 δ ˆ qj ij (1 − qj )1−δij , ˆ E-step: µi (zi ) ∝ i∈Nj M-step: qj = ˆ µi (Lij ) + α − 1 |Nj | + α + β − 2 j∈Mi , (1) where δij = I[Lij = zi ]; the zi is then estimated via zi = arg maxzi µi (zi ). [sent-73, score-2.877]

26 Let xi→j and yj→i be real-valued messages 0 from tasks to workers and from workers to tasks, respectively. [sent-77, score-0.654]

27 Initializing yj→i randomly from 0 Normal(1, 1) or deterministically by yj→i = 1, KOS updates the messages at t-th iteration via xt+1 = i→j t Lij yj Li j xt+1 , i →j t+1 yj→i = →i , j ∈Mi\j (2) i ∈Nj\i t and the labels are estimated via st = sign[ˆt ], where xt = ˆi xi ˆi j∈Mi Lij yj→i . [sent-78, score-0.14]

28 Note that the 0th 0 iteration of KOS reduces to majority voting when initialized with yj→i = 1. [sent-79, score-0.174]

29 KOS has surprisingly nice theoretical properties on locally tree-like assignment graphs: its error rate is shown to scale in the same manner as an oracle lower bound that assumes the true qj are known. [sent-80, score-0.696]

30 [1] – the performance of KOS was shown to degenerate, sometimes performing even worse than majority voting when the degrees of the assignment graph (corresponding to the number of annotators per task) are small. [sent-87, score-0.287]

31 By standard Bayesian arguments, one can show that the optimal estimator of z to minimize the bit-wise error rate is given by zi = arg max p(zi |L, θ) ˆ where p(zi |L, θ) = p(z, q|L, θ)dq. [sent-90, score-0.142]

32 (3) zi z[N ]\i q Note that the EM algorithm (1), which maximizes rather than marginalizes qj , is not equivalent to the Bayesian estimator (3), and hence is expected to be suboptimal in terms of error rate. [sent-91, score-0.725]

33 However, calculating the marginal p(zi |L, θ) in (3) requires integrating all q and summing over all the other zi , a challenging computational task. [sent-92, score-0.142]

34 In this work we use belief propagation and mean ﬁeld to address this problem, and highlight their connections to KOS, majority voting and EM. [sent-93, score-0.236]

35 A key perspective is that we can treat p(z|L, θ) as a discrete Markov random ﬁeld, and re-interpret the bipartite assignment graph as a factor graph [13], with the tasks mapping to variable nodes and workers to factor nodes. [sent-97, score-0.436]

36 j→i bt+1 (zi ) ∝ i Calculating the beliefs: mt+1 (zi ), i →j ψj (zNj ) (7) j∈Mi At the end of T iterations, the labels are estimated via zi = arg maxzi bt (zi ). [sent-99, score-0.206]

37 One immediate ˆt i difference between BP (5)-(7) and the KOS message passing (2) is that the messages and beliefs in (5)-(7) are probability tables on zi , i. [sent-100, score-0.212]

38 However, note that ψ(zNj ) in (4) depends on zNj only through cj , so that (with a slight abuse of notation) it can be rewritten as ψ(cj , γj ). [sent-105, score-0.107]

39 mi→j (−1) Then, sum-product BP (5)-(7) can be expressed as xt+1 = i→j t Lij yj →i , t+1 yj→i = log j ∈Mi\j γj −1 t+1 k=0 ψ(k + 1, γj ) ek , γj −1 t+1 k=0 ψ(k, γj ) ek (8) t+1 and xt+1 = ˆi = 0, . [sent-109, score-0.138]

40 , Nj − 1, are the j∈Mi Lij yi→j , where the terms ek for k elementary symmetric polynomials in variables {exp(Li j xi →j )}i ∈Nj\i , that is, ek = ˆt xt s : |s|=k i ∈s exp(Li j xi →j ). [sent-112, score-0.112]

41 In the end, the true labels are decoded as zi = sign[ˆi ]. [sent-113, score-0.201]

42 Similar to sum-product, one can also derive a max-product BP to ﬁnd the joint maximum a posteriori conﬁguration, z = arg maxz p(z|L, θ), which minimizes the block-wise error rate prob[∃i : zi = ˆ zi ] instead of the bit-wise error rate. [sent-115, score-0.284]

43 In this section, we discuss the form of ψ(cj , γj ) for different choices of algorithmic priors, and in particular show that KOS and majority voting can be treated as special cases of our belief propagation (8) with the most “uninformative” and most “informative” algorithmic priors, respectively. [sent-122, score-0.353]

44 For more general priors that may not yield a closed form for ψ(cj , γj ), one can calculate ψ(cj , γj ) by numerical integration and store them in a (γ + 1) × γ table for later use, where γ = maxj∈[M ] γj . [sent-123, score-0.089]

45 If p(qj |θ) ∝ qj (1 − qj )β−1 , we have ψ(cj , γj ) ∝ B(α + cj , β + γj − cj ), where B(·, ·) is the Beta function. [sent-125, score-1.38]

46 If p(qj |θ) has non-zero probability mass on only ﬁnite points, that is, prob(qj = qk ) = pk , k ∈ [K], where 0 ≤ qk ≤ 1, 0 ≤ pk ≤ 1 and k pk = 1, then we have ψ(cj , γj ) = ˜ ˜ c pk qkj (1 − qk )γj −cj . [sent-128, score-0.145]

47 The Haldane prior [15] is a special discrete prior that equals either 0 or 1 with equal probability, that is, prob[qj = 0] = prob[qj = 1] = 1/2. [sent-131, score-0.173]

48 The BP update in (8) with Haldane prior is equivalent to KOS update in (2). [sent-135, score-0.113]

49 Just substitute the ψ(cj , γj ) of Haldane prior shown above into the BP update (8). [sent-137, score-0.094]

50 The Haldane prior can also be treated as a Beta( , ) prior with → 0+ , or equivalently an improper −1 prior p(qj ) ∝ qj (1 − qj )−1 , whose normalization constant is inﬁnite. [sent-138, score-1.406]

51 One can show that the Haldane prior is equivalent to putting a ﬂat prior on the log-odds log[qj /(1 − qj )]; also, it has the largest variance (and hence is “most uninformative”) among all the possible distributions of qj . [sent-139, score-1.316]

52 Therefore, although appearing to be extremely dichotomous, it is well known in Bayesian statistics as an uninformative prior of binomial distributions. [sent-140, score-0.095]

53 Other choices of objective priors include the uniform prior Beta(1, 1) and Jeffery’s prior Beta(1/2, 1/2) [16], but these do not yield the same simple linear message passing form as the Haldane prior. [sent-141, score-0.263]

54 Unfortunately, the use of Haldane prior in our problem suffers an important symmetry breaking issue: if the prior is symmetric, i. [sent-142, score-0.15]

55 , p(qj |θ) = p(1 − qj |θ), the true marginal posterior distribution of zj is also symmetric, i. [sent-144, score-0.618]

56 The mechanism of KOS for breaking the symmetry seems to rely solely on initializing to points that bias towards majority voting, and the hope that the symmetric distribution is an unstable ﬁxed point. [sent-149, score-0.117]

57 In experiments, we ﬁnd that the use of symmetric priors usually leads to degraded performance when the degree of the assignment graph is low, corresponding to the phase transition phenomenon discussed in Karger et al. [sent-150, score-0.198]

58 This suggests that it is beneﬁcial to use asymmetric priors with E[qj |θ] > 1/2, to incorporate the prior knowledge that the majority of workers are non-adversarial. [sent-152, score-0.568]

59 Interestingly, it turns out that majority voting uses such an asymmetric prior, but unfortunately corresponding to another unrealistic extreme. [sent-153, score-0.197]

60 A deterministic prior is a special discrete distribution that equals a single point deterministically, i. [sent-155, score-0.098]

61 One can show that log ψ in this ˜ ˜ case is a linear function, that is, log ψ(cj , γj ) = cj logit(˜) + const. [sent-158, score-0.139]

62 The BP update (8) with deterministic priors satisfying q > 1/2 terminates at the ﬁrst ˜ iteration and ﬁnds the same solution as majority voting. [sent-161, score-0.195]

63 Just note that log ψ(cj , γj ) = cj logit(˜) + const, and logit(˜) > 0 in this case. [sent-163, score-0.123]

64 q q The deterministic priors above have the opposite properties to the Haldane prior: they can be also treated as Beta(α, β) priors, but with α → +∞ and α > β; these priors have the smallest variance (equal to zero) among all the possible qj priors. [sent-164, score-0.776]

65 In this work, we propose to use priors that balance between KOS and majority voting. [sent-165, score-0.176]

66 2 The Two-Coin Models and Further Extensions We previously assumed that workers’ abilities are parametrized by a single parameter qj . [sent-170, score-0.636]

67 Here we consider the more general case, where the ability of worker j is speciﬁed by two parameters, the sensitivitiy sj and speciﬁcity tj [2, 4], tj = prob[Lij = −1|zi = −1]. [sent-172, score-0.161]

68 sj = prob[Lij = +1|zi = +1], 5 A typical prior on sj and tj are two independent Beta distributions. [sent-173, score-0.174]

69 One may also try to derive a two-coin version of KOS, by assigning two independent Haldane priors on sj and tj ; it turns out that the extended version is exactly the same as the standard KOS in (2). [sent-177, score-0.157]

70 Assuming the algorithmic prior of Beta(α, β), one crucial property of the KL objective in this case is that the ∗ optimal {νj (qj )} is guaranteed to be a Beta distribution as well. [sent-189, score-0.126]

71 For example, note that the qj in the M-step of EM could be updated to 0 or 1 if ˆ α = 1 or β = 1, and once this happens, the qj is locked at its current value, causing EM to trapped ˆ at a local maximum. [sent-195, score-1.166]

72 Update (11) can prevent qj from becoming 0 or 1, avoiding the degenerate ¯ case. [sent-196, score-0.583]

73 One can of course interpret (11) as EM with prior parameters α = α + 1, and β = β + 1; under this interpretation, it is advisable to choose priors α > 1 and β > 1 (corresponding to a less common or intuitive “informative” prior). [sent-197, score-0.164]

74 We should point out that it is widely known that EM can be interpreted as a coordinate descent on variational objectives [18, 11]; our derivation differs in that we marginalize, instead of maximize, over qj . [sent-198, score-0.606]

75 On the other hand, discrete priors do not seem to lead to interesting algorithms in this case. [sent-201, score-0.089]

76 We implement majority voting (MV), KOS in (2), BP in (8), EM in (1) and 6 −1 Log−error −0. [sent-203, score-0.174]

77 5 5 10 ( = γ) 15 Figure 1: The performance of the algorithms as the degrees of the assignment graph vary; the left degree denotes the number of workers per task, and the right degree γ denotes the number of tasks per worker. [sent-211, score-0.396]

78 7 (b) Beta prior (ﬁxed α + β = 1) (c) Spammer-hammer prior 0 0. [sent-233, score-0.15]

79 2 Percentage of Adversaries (d) Adversary-spammer-hammer prior Figure 2: The performance on data generated with different qj priors on (9,9)-regular random graphs. [sent-235, score-0.747]

80 We initialize BP (including KOS) with yj→i = 1 and EM with µi (zi ) = j∈Mi I[Lij = zi ]/|Mi |, both of which reduce to majority voting at the 0-th iteration; for KOS, we also implemented another version that exactly follows the setting of Karger et al. [sent-248, score-0.316]

81 For EM with algorithmic prior Beta(α, β), we add a small constant (0. [sent-250, score-0.126]

82 We generate simulated data by drawing the abilities qj from Beta priors or the adversary-spammer-hammer priors, that equals 0. [sent-256, score-0.773]

83 9 with certain probabilities; the assignment graphs are randomly drawn from the set of ( , γ)-regular bipartite graphs with 1000 task nodes using the conﬁguration method [20]. [sent-259, score-0.136]

84 [1] that assumes the true qj are known, as well as a BP equipped with an algorithmic prior equal to the true prior used to generate the data, which sets a tighter (perhaps approximate) “Bayesian oracle” lower bound for all the algorithms that do not know qj . [sent-261, score-1.403]

85 We ﬁnd that BP and AMF with a typical asymmetric prior Beta(2, 1) perform mostly as well as the “Bayesian oracle” bound, eliminating the necessity to search for more accurate algorithmic priors. [sent-262, score-0.179]

86 Perhaps surprisingly, we ﬁnd that the BP, EM and AMF with the asymmetric algorithmic prior beta(2, 1) scale similarly to KOS, which has been theoretically shown to be order-optimal compared to the oracle lower bound. [sent-268, score-0.178]

87 On the other hand, BP with symmetric algorithmic priors, such as the Haldane prior Beta(0+ , 0+ ) of KOS and the uniform prior Beta(1, 1), often result in degraded performance when the degrees of the 7 Error Rate 0. [sent-269, score-0.247]

88 Indeed, BP with symmetric algorithmic priors often fails to converge in the low-degree setting. [sent-291, score-0.17]

89 2 shows the performance of the algorithms when varying the true priors of the data. [sent-293, score-0.107]

90 2(b) and (d) that the performance of EM with Beta(2, 1) tends to degrade when the fraction of adversaries increases, probably because the qj is more likely to be incorrectly updated to and stuck ˆ on 0 or 1 in these cases; see the discussion in Section 3. [sent-295, score-0.675]

91 Our implementation of EM (on both simulated data and the real data below) seems to perform better than previously reported results, probably due to our careful choice on the prior and initialization. [sent-298, score-0.121]

92 The symmetric assumption of qj = sj = tj is likely to be violated in practice, especially on vision datasets where a human’s perception decides on whether some object is present. [sent-301, score-0.681]

93 Therefore we also implemented the two-coin version of BP and AMF(EM) with the algorithmic priors of sj and tj taken as two independent Beta(2, 1) (referred to as BP-Beta2 (2,1) and similar). [sent-302, score-0.208]

94 [6], including 108 tasks and 39 workers on a fully connected bipartite assignment graph, where the workers are asked whether the presented images contain Indigo Bunting or Blue GrosBeak. [sent-304, score-0.696]

95 This suggests that the symmetric assumption is badly violated on this dataset, probably caused by the non-expert workers with high sensitivities but low speciﬁcities, having trouble identifying Indigo Bunting from Blue GrosBeak. [sent-309, score-0.345]

96 We then tested on two natural language processing datasets in [21], the rte dataset with 800 tasks and 164 workers, and the temp dataset with 462 tasks and 76 workers. [sent-310, score-0.121]

97 The KOS algorithm does surprisingly poorly, probably due to the assignment graphs not having locally tree-like structures. [sent-313, score-0.111]

98 5 Conclusion We have presented a spectrum of inference algorithms, in particular a novel and efﬁcient BP algorithm, for crowdsourcing problems and clariﬁed their connections to existing methods. [sent-314, score-0.096]

99 Maximum likelihood estimation of observer error-rates using the em algorithm. [sent-338, score-0.124]

100 Eliminating spammers and ranking annotators for crowdsourced labeling tasks. [sent-376, score-0.174]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qj', 0.583), ('kos', 0.416), ('workers', 0.294), ('bp', 0.242), ('beta', 0.189), ('lij', 0.16), ('amf', 0.15), ('zi', 0.142), ('prob', 0.141), ('haldane', 0.139), ('em', 0.124), ('cj', 0.107), ('znj', 0.102), ('karger', 0.094), ('mf', 0.093), ('priors', 0.089), ('majority', 0.087), ('voting', 0.087), ('spammers', 0.081), ('crowdsourcing', 0.08), ('mi', 0.075), ('prior', 0.075), ('adversaries', 0.071), ('nj', 0.064), ('worker', 0.056), ('yj', 0.056), ('abilities', 0.053), ('mv', 0.052), ('algorithmic', 0.051), ('annotators', 0.05), ('assignment', 0.046), ('welinder', 0.042), ('labels', 0.041), ('tasks', 0.039), ('tj', 0.037), ('turk', 0.034), ('ek', 0.033), ('bj', 0.032), ('propagation', 0.032), ('aj', 0.032), ('sj', 0.031), ('belief', 0.03), ('symmetric', 0.03), ('eliminating', 0.03), ('oracle', 0.029), ('mechanical', 0.028), ('mt', 0.028), ('amazon', 0.027), ('messages', 0.027), ('eld', 0.027), ('simulated', 0.025), ('graphs', 0.024), ('labeling', 0.024), ('message', 0.024), ('equals', 0.023), ('logit', 0.023), ('variational', 0.023), ('asymmetric', 0.023), ('bluebird', 0.023), ('bunting', 0.023), ('indigo', 0.023), ('maxzi', 0.023), ('rte', 0.023), ('bipartite', 0.023), ('complicated', 0.023), ('pk', 0.022), ('ij', 0.021), ('probably', 0.021), ('answers', 0.021), ('temp', 0.02), ('ics', 0.02), ('surprisingly', 0.02), ('uninformative', 0.02), ('task', 0.019), ('appendix', 0.019), ('beliefs', 0.019), ('update', 0.019), ('qualities', 0.019), ('eb', 0.019), ('crowdsourced', 0.019), ('dawid', 0.019), ('raykar', 0.019), ('qk', 0.019), ('true', 0.018), ('graphical', 0.017), ('graph', 0.017), ('generative', 0.017), ('ihler', 0.017), ('digamma', 0.017), ('zj', 0.017), ('ln', 0.016), ('degraded', 0.016), ('inference', 0.016), ('log', 0.016), ('perhaps', 0.016), ('xt', 0.016), ('unreliable', 0.015), ('bayesian', 0.015), ('treated', 0.015), ('optimality', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

2 0.13762456 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

Author: Andrew Delong, Olga Veksler, Anton Osokin, Yuri Boykov

Abstract: Inference in high-order graphical models has become important in recent years. Several approaches are based, for example, on generalized message-passing, or on transformation to a pairwise model with extra ‘auxiliary’ variables. We focus on a special case where a much more efﬁcient transformation is possible. Instead of adding variables, we transform the original problem into a comparatively small instance of submodular vertex-cover. These vertex-cover instances can then be attacked by existing algorithms (e.g. belief propagation, QPBO), where they often run 4–15 times faster and ﬁnd better solutions than when applied to the original problem. We evaluate our approach on synthetic data, then we show applications within a fast hierarchical clustering and model-ﬁtting framework. 1

3 0.083567746 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to deﬁne an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by deﬁning the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difﬁculty in learning the pairwise similarity measure is that there is a signiﬁcant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difﬁculty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efﬁciency. 1

4 0.083288468 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the ﬂy. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

5 0.072993644 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt

Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1

6 0.072979711 280 nips-2012-Proper losses for learning from partial labels

7 0.061169811 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

8 0.061059773 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

9 0.05457107 65 nips-2012-Cardinality Restricted Boltzmann Machines

10 0.054350119 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

11 0.053462453 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

12 0.053191062 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

13 0.051864989 286 nips-2012-Random Utility Theory for Social Choice

14 0.047999699 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

15 0.04755187 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

16 0.047060948 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

17 0.046223141 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

18 0.044955026 167 nips-2012-Kernel Hyperalignment

19 0.043569595 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

20 0.043406464 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.121), (1, 0.02), (2, 0.01), (3, -0.006), (4, -0.075), (5, -0.035), (6, 0.006), (7, -0.008), (8, 0.008), (9, -0.008), (10, 0.003), (11, 0.007), (12, 0.012), (13, 0.016), (14, 0.044), (15, 0.024), (16, -0.018), (17, -0.068), (18, -0.028), (19, 0.038), (20, 0.021), (21, -0.032), (22, 0.003), (23, 0.082), (24, 0.065), (25, 0.085), (26, 0.05), (27, -0.032), (28, 0.021), (29, 0.025), (30, -0.037), (31, -0.014), (32, 0.025), (33, -0.023), (34, -0.038), (35, 0.038), (36, -0.015), (37, -0.096), (38, -0.092), (39, 0.062), (40, -0.044), (41, -0.082), (42, -0.103), (43, -0.004), (44, 0.029), (45, -0.037), (46, -0.09), (47, 0.043), (48, 0.038), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90445465 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

2 0.64810407 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

Author: Andrew Delong, Olga Veksler, Anton Osokin, Yuri Boykov

3 0.55436337 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of conﬁgurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on ﬁner-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-ﬁeld based bounds. 1

4 0.49391273 206 nips-2012-Majorization for CRFs and Latent Likelihoods

Author: Tony Jebara, Anna Choromanska

Abstract: The partition function plays a key role in probabilistic modeling including conditional random ﬁelds, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efﬁcient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as ﬁrst-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and ﬁgures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This deﬁnes j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Deﬁne the partition function over only the ﬁrst i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Deﬁne the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly inﬁnite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simpliﬁes into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the deﬁnition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Deﬁne the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the deﬁnition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the deﬁnition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the deﬁnition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each conﬁguration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (ﬁrst level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all conﬁgurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all conﬁgurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑  ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [    ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simpliﬁcation of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now      m2,1 m1,i ∑  ∏ ∑ ∏ ∑   Z(θ) = ψ(w) . ψ(u)  ψ(v)  3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 deﬁned as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2  d d 2 ∑ ∑ l(i)  . x(i)2 l(i)2 ≥  x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2  ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2  . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥  ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-deﬁnite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we deﬁne Vi−1 = Qi−1 Vi−1 and deﬁned Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Deﬁne F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Deﬁne v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufﬁcient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider ﬂipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, ﬂipping the sign of v(i) is equivalent to ﬂipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on ﬁve UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Inﬂuence of rank k on bound performance in large scale experiments We also examined the inﬂuence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superﬂuously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17

5 0.49025622 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

6 0.48382908 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

7 0.4713577 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

8 0.46510932 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

9 0.46081492 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

10 0.4561649 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

11 0.44290936 37 nips-2012-Affine Independent Variational Inference

12 0.43559942 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

13 0.42579982 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

14 0.42564756 294 nips-2012-Repulsive Mixtures

15 0.41687989 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

16 0.41389525 65 nips-2012-Cardinality Restricted Boltzmann Machines

17 0.41373238 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

18 0.41326925 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

19 0.41193539 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

20 0.40553543 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.047), (21, 0.023), (38, 0.12), (39, 0.012), (42, 0.019), (53, 0.356), (54, 0.02), (55, 0.01), (74, 0.057), (76, 0.111), (80, 0.105), (92, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82630235 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, deﬁned either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm deﬁned through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efﬁcient unlike other order based distance measures. 1

same-paper 2 0.80326229 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

3 0.80007368 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very eﬃcient active learning algorithms for link classiﬁcation in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1

4 0.73346865 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially inﬁnite feature vectors to a ﬁxed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and ﬁfty-ﬁve Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

5 0.73108929 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Speciﬁcally, we take inspiration from the conditional mean-ﬁeld recursive equations of the Replicated Softmax in order to deﬁne a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

6 0.67221469 30 nips-2012-Accuracy at the Top

7 0.64612812 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

8 0.54180914 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

9 0.54082614 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

10 0.53156084 160 nips-2012-Imitation Learning by Coaching

11 0.53030735 293 nips-2012-Relax and Randomize : From Value to Algorithms

12 0.52351058 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

13 0.52279168 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.52052897 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

15 0.51977158 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

16 0.51845032 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

17 0.51713824 298 nips-2012-Scalable Inference of Overlapping Communities

18 0.51212674 260 nips-2012-Online Sum-Product Computation Over Trees

19 0.51028103 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

20 0.50632834 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models