nips nips2010 nips2010-173 knowledge-graph by maker-knowledge-mining

173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Source: pdf

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract The sample complexity of active learning under the realizability assumption has been well-studied. [sent-4, score-0.5]

2 In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. [sent-6, score-0.444]

3 We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. [sent-7, score-0.708]

4 1 Introduction In active learning [10, 13, 16], the learner draws unlabeled data from the unknown distribution deﬁned on the learning task and actively queries some labels from an oracle. [sent-9, score-0.49]

5 In this way, the active learner can achieve good performance with much fewer labels than passive learning. [sent-10, score-0.428]

6 The number of these queried labels, which is necessary and sufﬁcient for obtaining a good leaner, is well-known as the sample complexity of active learning. [sent-11, score-0.455]

7 Many theoretical bounds on the sample complexity of active learning have been derived based on the realizability assumption (i. [sent-12, score-0.526]

8 , there exists a hypothesis perfectly separating the data in the hypothesis class) [4, 5, 11, 12, 14, 16]. [sent-14, score-0.196]

9 Recently, the sample complexity of active learning in the non-realizable case (i. [sent-16, score-0.444]

10 , the data cannot be perfectly separated by any hypothesis in the hypothesis class because of the noise) has been studied [2, 13, 17]. [sent-18, score-0.236]

11 This suggests that perhaps active learning in the non-realizable case is not as efﬁcient as that in the realizable case. [sent-20, score-0.354]

12 To improve the sample complexity of active learning in the non-realizable case remarkably, the model of the noise or some assumptions on the hypothesis class and the data distribution must be considered. [sent-21, score-0.677]

13 Tsybakov noise model [21] is more and more popular in theoretical analysis on the sample complexity of active learning. [sent-22, score-0.518]

14 However, existing result [8] shows that obtaining exponential improvement in the sample complexity of active learning with unbounded Tsybakov noise is hard. [sent-23, score-0.702]

15 Inspired by [23] which proved that multi-view setting [6] can help improve the sample complexity of active learning in the realizable case remarkably, we have an insight that multi-view setting will also help active learning in the non-realizable case. [sent-24, score-0.913]

16 In this paper, we present the ﬁrst analysis on the 1 sample complexity of active learning in the non-realizable case under multi-view setting, where the non-realizability is caused by Tsybakov noise. [sent-25, score-0.46]

17 -We prove that the sample complexity of active learning with Tsybakov noise under multi-view setting can be improved to O(log 1 ) when the learner satisﬁes non-degradation condition. [sent-27, score-0.61]

18 1 This ǫ exponential improvement holds no matter whether Tsybakov noise is bounded or not, contrasting to single-view setting where the polynomial improvement is the best possible achievement for active learning with unbounded Tsybakov noise. [sent-28, score-0.782]

19 -We also prove that, when non-degradation condition does not hold, the sample complexity of active learning with unbounded Tsybakov noise under multi-view setting is O( 1 ), where the order of ǫ 1/ǫ is independent of the parameter in Tsybakov noise, i. [sent-29, score-0.761]

20 , the sample complexity is always O( 1 ) ǫ no matter how large the unbounded Tsybakov noise is. [sent-31, score-0.362]

21 While in previous polynomial bounds, the order of 1/ǫ is related to the parameter in Tsybakov noise and is larger than 1 when unbounded Tsybakov noise is larger than some degree (see Section 2). [sent-32, score-0.45]

22 This discloses that, when non-degradation condition does not hold, multi-view setting is still able to lead to a faster convergence rate and our polynomial improvement in the sample complexity is better than previous polynomial bounds when unbounded Tsybakov noise is large. [sent-33, score-0.626]

23 We analyze the sample complexity of active learning with Tsybakov noise under multi-view setting with and without the non-degradation condition in Section 5 and Section 6, respectively. [sent-36, score-0.613]

24 [2] proposed the agnostic active learning algorithm 2 A2 and proved that its sample complexity is O( η2 ). [sent-40, score-0.536]

25 2 Hoping to get tighter bound on the sample ǫ complexity of the algorithm A2 , Hanneke [17] deﬁned the disagreement coefﬁcient θ, which depends on the hypothesis class and the data distribution, and proved that the sample complexity of the 2 algorithm A2 is O(θ2 η2 ). [sent-41, score-0.473]

26 [13] developed a general agnostic active learning ǫ 2 algorithm which extends the scheme in [10] and proved that its sample complexity is O(θ η2 ). [sent-43, score-0.536]

27 ǫ Recently, the popular Tsybakov noise model [21] was considered in theoretical analysis on active learning and there have been some bounds on the sample complexity. [sent-44, score-0.484]

28 For some simple cases, where Tsybakov noise is bounded, it has been proved that the exponential improvement in the sample complexity is possible [4, 7, 18]. [sent-45, score-0.3]

29 As for the situation where Tsybakov noise is unbounded, only polynomial improvement in the sample complexity has been obtained. [sent-46, score-0.342]

30 [4] assumed that the samples are drawn uniformly from the the unit ball in Rd and proved that the sample 2 complexity of active learning with unbounded Tsybakov noise is O ǫ− 1+λ (λ > 0 depends on Tsybakov noise). [sent-48, score-0.71]

31 Castro and Nowak [8] showed that the sample complexity of active learning with unbounded Tsybakov 2µω+d−2ω−1 µω noise is O ǫ− (µ > 1 depends on another form of Tsybakov noise, ω ≥ 1 depends on the H¨ lder smoothness and d is the dimension of the data). [sent-50, score-0.687]

32 [9] assumed that the labels of examples are generated according to a simple linear noise model and indicated that the sample complexity 2(3+λ) of active learning with unbounded Tsybakov noise is O ǫ− (1+λ)(2+λ) . [sent-53, score-0.854]

33 Hanneke [18] proved that the algorithms or variants thereof in [2] and [13] can achieve the polynomial sample complexity 2 O ǫ− 1+λ for active learning with unbounded Tsybakov noise. [sent-54, score-0.659]

34 For active learning with unbounded Tsybakov noise, Castro and Nowak [8] also proved that at least Ω(ǫ−ρ ) labels are requested to learn 1 2 The O notation is used to hide the factor log log( 1 ). [sent-55, score-0.651]

35 This result shows that the polynomial improvement is the best possible achievement for active learning with unbounded Tsybakov noise in single-view setting. [sent-58, score-0.636]

36 Nevertheless, this result is for approximate Tsybakov noise and ǫ the assumption on large smoothness order (or inﬁnite smoothness order) rarely holds for data with high dimension d in practice. [sent-60, score-0.196]

37 Let H1 and H2 be the hypothesis class in each view and suppose that c1 ∈ H1 and c2 ∈ H2 . [sent-65, score-0.179]

38 For any instance x = (x1 , x2 ), the hypothesis hv ∈ Hv (v = 1, 2) makes that hv (xv ) = 1 if xv ∈ Sv and hv (xv ) = 0 otherwise, where Sv is a subset of Xv . [sent-66, score-0.817]

39 In this way, any hypothesis hv ∈ Hv corresponds to a subset Sv of Xv (as for how to combine the hypotheses in the two views, see Section 5). [sent-67, score-0.239]

40 Considering that x1 and x2 denote the same instance x in different views, we overload Sv to denote the instance set {x = (x1 , x2 ) : xv ∈ Sv } ∗ without confusion. [sent-68, score-0.317]

41 Here, we also overload Sv to denote ∗ the instances set {x = (x1 , x2 ) : xv ∈ Sv }. [sent-71, score-0.38]

42 The error rate of a hypothesis Sv under the distribution ∗ D is R(hv ) = R(Sv ) = P r(x1 ,x2 ,y)∈D y = I(xv ∈ Sv ) . [sent-72, score-0.178]

43 ∗ R(Sv ) − R(Sv ) = |2ϕv (xv ) − 1|pxv dxv ∗ Sv ∆Sv ∗ d(Sv , Sv ) (1) Let ηv denote the error rate of the optimal Bayes classiﬁer cv which is also called as the noise rate in the non-realizable case. [sent-74, score-0.238]

44 We will use the following lamma [1] which gives the standard sample complexity for non-realizable learning task. [sent-80, score-0.156]

45 N 1 | I h(xi ) = y i − E(x,y)∈D I h(x) = y | ≤ ǫ i=1 N 4 α-Expansion in the Non-realizable Case Multi-view active learning ﬁrst described in [20] focuses on the contention points (i. [sent-86, score-0.457]

46 , unlabeled instances on which different views predict different labels) and queries some labels of them. [sent-88, score-0.278]

47 It is motivated by that querying the labels of contention points may help at least one of the two views to learn the optimal classiﬁer. [sent-89, score-0.408]

48 end iterate Output: hs and hs − + between S1 and S2 , then P r(S1 ⊕ S2 ) denotes the probability mass on the contentions points. [sent-91, score-0.698]

49 In this paper, we use “∆” when referring the excess error ∗ between Sv and Sv and use “⊕” when referring the difference between the two views S1 and S2 . [sent-93, score-0.182]

50 In order to study multi-view active learning, the properties of contention points should be considered. [sent-94, score-0.436]

51 One basic property is that P r(S1 ⊕ S2 ) should not be too small, otherwise the two views could be exactly the same and two-view setting would degenerate into single-view setting. [sent-95, score-0.11]

52 In multi-view learning, the two views represent the same learning task and generally are consistent with each other, i. [sent-96, score-0.112]

53 , for any instance x = (x1 , x2 ) the labels of x in the two views are the same. [sent-98, score-0.179]

54 The instances agreed by the two views can be denoted as (S1 ∩S2 )∪(S1 ∩S2 ). [sent-102, score-0.192]

55 However, some of these agreed instances may be predicted different label by the optimal classiﬁer S ∗ , i. [sent-103, score-0.117]

56 Intuitively, if the contention points can convey some information about (S1 ∩ S2 − S ∗ ) ∪ (S1 ∩ S2 − S ∗ ), then querying the labels of contention points could help to improve S1 and S2 . [sent-106, score-0.43]

57 P r S1 ⊕ S2 ≥ α P r S1 ∩ S2 − S ∗ + P r S1 ∩ S2 − S ∗ (3) We say that D is α-expanding with respect to hypothesis class H1 × H2 if the above holds for all S1 ∈ H1 ∩ X1 , S2 ∈ H2 ∩ X2 (here we denote by Hv ∩ Xv the set {h ∩ Xv : h ∈ Hv } for v = 1, 2). [sent-109, score-0.196]

58 [3] also gave a deﬁnition of expansion, P r(T1 ⊕ T2 ) ≥ α min P r(T1 ∩ T2 ), P r(T1 ∩ T2 ) , for realizable learning task under the assumptions that the learner in each view is never “conﬁdent but wrong” and the learning algorithm is able to learn from positive data only. [sent-111, score-0.145]

59 In addition, in [3] the instances which are agreed by the two views but are predicted different label by the optimal classiﬁer can be denoted as T1 ∩ T2 . [sent-117, score-0.208]

60 So, it can be found that Deﬁnition 1 and the deﬁnition of expansion in [3] are based on the same intuition that the amount of contention points is no less than a fraction of the amount of instances which are agreed by the two views but are predicted different label by the optimal classiﬁers. [sent-118, score-0.372]

61 5 Multi-view Active Learning with Non-degradation Condition In this section, we ﬁrst consider the multi-view learning in Table 1 and analyze whether multiview setting can help improve the sample complexity of active learning in the non-realizable case remarkably. [sent-119, score-0.523]

62 In this paper, we consider the following two combination schemes, h+ and h− , for binary classiﬁcation: hi (x) = + 1 if hi (x1 ) = hi (x2 ) = 1 1 2 0 otherwise hi (x) = − 4 0 if hi (x1 ) = hi (x2 ) = 0 1 2 1 otherwise (4) 5. [sent-121, score-0.858]

63 1 ∗ ∗ The Situation Where S1 = S2 With (4), the error rate of the combined classiﬁers hi and hi satisfy (5) and (6), respectively. [sent-122, score-0.405]

64 + − i i i i R(hi ) − R(S ∗ ) = R(S1 ∩ S2 ) − R(S ∗ ) ≤ d∆ (S1 ∩ S2 , S ∗ ) + (5) R(hi ) − (6) ∗ − R(S ) = i R(S1 ∪ i S2 ) ∗ − R(S ) ≤ i d∆ (S1 ∪ i S2 , S ∗ ) i Here Sv ⊂ Xv (v = 1, 2) corresponds to the classiﬁer hi ∈ Hv in the i-th round. [sent-123, score-0.143]

65 In each round of v multi-view active learning, labels of some contention points are queried to augment the training data set L and the classiﬁer in each view is then reﬁned. [sent-124, score-0.596]

66 , (7) holds, which implies that the excess error of Sv is no larger than that of Sv in i ⊕ Si . [sent-127, score-0.099]

67 , given an instance xv in πξ , 3−v x3−v will only be in one of these πς ). [sent-138, score-0.296]

68 Suppose the learning algorithm will predict all instances in each cluster with the same label, i. [sent-139, score-0.109]

69 , the hypothesis class Hv consists of the hypotheses which v do not split any cluster. [sent-141, score-0.138]

70 Thus, the cluster πξ can be classiﬁed according to the posterior probability v v P (y = 1|πξ ) and querying the labels of instances in cluster πξ will not inﬂuence the estimation of v the posterior probability for cluster πς (ς = ξ). [sent-142, score-0.285]

71 Here, V = max[V C(H1 ), V C(H2 )] where V C(H) denotes the VC-dimension of the hypothesis −1/λ class H, k = 1+λ , C1 = 2C0 λ(λ + 1)−1−1/λ and C2 = 5α+8 . [sent-147, score-0.138]

72 8 As in each round some contention points are queried and added into the training set, the difference i+1 i+1 i i between the two views is decreasing, i. [sent-152, score-0.289]

73 γi = 5 s From Theorem 1 we know that we only need to request i=0 mi = O(log 1 ) labels to learn hs + ǫ and hs , at least one of which is with error rate no larger than R(S ∗ ) + ǫ with probability at least − 1 − δ. [sent-158, score-1.032]

74 If we choose hs and it happens to satisfy R(hs ) ≤ R(S ∗ ) + ǫ, we can get a classiﬁer whose + + error rate is no larger than R(S ∗ ) + ǫ. [sent-159, score-0.508]

75 To study how to choose between hs and hs , we give − + Deﬁnition 2 at ﬁrst. [sent-161, score-0.658]

76 P r {x : x ∈ S1 ⊕ S2 ∧ y(x) = 0} P r {x : x ∈ S1 ⊕ S2 ∧ y(x) = 1} − P r(S1 ⊕ S2 ) P r(S1 ⊕ S2 ) ≥β (8) (8) implies the difference between the examples belonging to positive class and that belonging to negative class in the contention region of S1 ⊕ S2 . [sent-163, score-0.264]

77 2 log( 4 ) s s Lemma 2 If the multi-view classiﬁers S1 and S2 satisfy β-condition, with the number of β 2 δ s s labels we can decide correctly whether P r {x : x ∈ S1 ⊕ S2 ∧ y(x) = 1} or P r {x : x ∈ s s S1 ⊕ S2 ∧ y(x) = 0} ) is smaller with probability at least 1 − δ. [sent-166, score-0.178]

78 From Theorem 2 we know that we only need to request O(log 1 ) labels to learn a classiﬁer with ǫ error rate no larger than R(S ∗ ) + ǫ with probability at least 1 − δ. [sent-168, score-0.298]

79 Thus, we achieve an exponential improvement in sample complexity of active learning in the non-realizable case under multi-view setting. [sent-169, score-0.475]

80 s s s s P r {x : x ∈ S1 ⊕ S2 ∧ y(x) = 0} P r {x : x ∈ S1 ⊕ S2 ∧ y(x) = 1} − s ⊕ Ss) s ⊕ Ss) P r(S1 P r(S1 2 2 = O(ǫ) (9) If so, we need not to estimate whether R(hs ) or R(hs ) is smaller and Theorem 3 indicates that + − both hs and hs are good approximations of the optimal classiﬁer. [sent-173, score-0.642]

81 2 ∗ ∗ The Situation Where S1 = S2 Although the two views represent the same learning task and generally are consistent with each ∗ ∗ other, sometimes S1 may be not equal to S2 . [sent-178, score-0.112]

82 So, learning a classiﬁer with error rate no larger than R(S1 ∩ S2 ) + ǫ is not ∗ harder than learning a classiﬁer with error rate no larger than R(Sv ) + ǫ. [sent-187, score-0.286]

83 Now we aim at learning ∗ ∗ a classiﬁer with error rate no larger than R(S1 ∩ S2 ) + ǫ. [sent-188, score-0.143]

84 If R(Sv ) ≤ R(S1 ∩ S2 ), we get a classiﬁer with error rate ∗ ∗ no larger than R(S1 ∩ S2 ) + ǫ. [sent-193, score-0.148]

85 Thus, we can neglect the probability mass on the hypothesis whose ∗ ∗ ∗ ∗ ∗ ∗ error rate is less than R(S1 ∩ S2 ) and regard S1 ∩ S2 as the optimal. [sent-194, score-0.21]

86 ∗ ∗ Theorem 4 shows that for the situation where S1 = S2 , by requesting O(log 1 ) labels we can learn ǫ s s ∗ ∗ two classiﬁers h+ and h− , at least one of which is with error rate no larger than R(S1 ∩ S2 ) + ǫ with probability at least 1 − δ. [sent-197, score-0.444]

87 , P r(S1 ⊕S2 ) ≤ ǫ/2, we have Corollary 1 which indicates that the exponential improvement in the sample complexity of active learning with Tsybakov noise is still possible. [sent-203, score-0.57]

88 6 Multi-view Active Learning without Non-degradation Condition Section 5 considers situations when the non-degradation condition holds, there are cases, however, the non-degradation condition (7) does not hold. [sent-206, score-0.11]

89 In this section we focus on the multi-view active learning in Table 2 and give an analysis with the non-degradation condition waived. [sent-207, score-0.38]

90 Firstly, we give ∗ ∗ Theorem 6 for the sample complexity of multi-view active learning in Table 2 when S1 = S2 = S ∗ . [sent-208, score-0.46]

91 In the (i + 1)-th round, we randomly query (2i+1 − 1)mi labels from Qi and add i+1 them into L. [sent-212, score-0.131]

92 s Theorem 6 shows that we can request i=0 2i mi = O( 1 ) labels to learn two classiﬁers hs and hs , + − ǫ at least one of which is with error rate no larger than R(S ∗ ) + ǫ with probability at least 1 − δ. [sent-217, score-1.032]

93 To guarantee the non-degradation condition (7), we only need to query (2i − 1)mi more labels in the i-th round. [sent-218, score-0.186]

94 Theorem 7 shows that, without the non-degradation condition, we need to request O( 1 ) labels to ǫ learn a classiﬁer with error rate no larger than R(S ∗ ) + ǫ with probability at least 1 − δ. [sent-221, score-0.298]

95 Similarly to Theorem 3, we get Theorem 8 which indicates that both hs and hs are good approximations of the optimal classiﬁer. [sent-223, score-0.668]

96 7 Conclusion We present the ﬁrst study on active learning in the non-realizable case under multi-view setting in this paper. [sent-230, score-0.328]

97 We prove that the sample complexity of multi-view active learning with unbounded Tsybakov noise can be improved to O(log 1 ), contrasting to single-view setting where only polynomial ǫ improvement is proved possible with the same noise condition. [sent-231, score-0.968]

98 Linear classiﬁcation and selective sampling under low noise conditions. [sent-297, score-0.111]

99 A bound on the label complexity of agnostic active learning. [sent-349, score-0.438]

100 On multi-view active learning and the combination with semisupervised learning. [sent-379, score-0.309]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sv', 0.6), ('hs', 0.321), ('tsybakov', 0.308), ('xv', 0.296), ('active', 0.288), ('contention', 0.148), ('hi', 0.143), ('hv', 0.141), ('unbounded', 0.132), ('qi', 0.124), ('requesting', 0.111), ('hypothesis', 0.098), ('noise', 0.095), ('views', 0.091), ('labels', 0.088), ('classi', 0.086), ('er', 0.085), ('complexity', 0.081), ('ers', 0.073), ('balcan', 0.065), ('instances', 0.063), ('theorem', 0.059), ('mi', 0.057), ('realizability', 0.056), ('condition', 0.055), ('sample', 0.054), ('agnostic', 0.053), ('contrasting', 0.053), ('castro', 0.045), ('realizable', 0.045), ('rate', 0.045), ('polynomial', 0.044), ('query', 0.043), ('holds', 0.043), ('larger', 0.042), ('class', 0.04), ('satisfy', 0.039), ('proved', 0.039), ('agreed', 0.038), ('hanneke', 0.037), ('situation', 0.037), ('request', 0.037), ('unlabeled', 0.036), ('learner', 0.036), ('least', 0.035), ('error', 0.035), ('corollary', 0.035), ('minh', 0.034), ('queried', 0.032), ('improvement', 0.031), ('generate', 0.03), ('cavallanti', 0.028), ('table', 0.028), ('nition', 0.028), ('querying', 0.027), ('delete', 0.027), ('bounds', 0.026), ('rarely', 0.026), ('get', 0.026), ('bayes', 0.026), ('colt', 0.026), ('cluster', 0.025), ('nanjing', 0.025), ('achievement', 0.025), ('log', 0.025), ('iterate', 0.024), ('hide', 0.023), ('polylog', 0.023), ('dasgupta', 0.023), ('excess', 0.022), ('tv', 0.022), ('view', 0.022), ('overload', 0.021), ('learning', 0.021), ('ing', 0.02), ('multiview', 0.02), ('compose', 0.02), ('lemma', 0.02), ('setting', 0.019), ('ss', 0.019), ('help', 0.019), ('suppose', 0.019), ('cv', 0.018), ('xj', 0.018), ('belonging', 0.018), ('round', 0.018), ('referring', 0.017), ('nowak', 0.017), ('label', 0.016), ('passive', 0.016), ('smoothness', 0.016), ('give', 0.016), ('prove', 0.016), ('caused', 0.016), ('probability', 0.016), ('selective', 0.016), ('blum', 0.016), ('mass', 0.016), ('expansion', 0.016), ('respect', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

2 0.21764405 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classiﬁcation. 1

3 0.14722428 25 nips-2010-Active Learning by Querying Informative and Representative Examples

Author: Sheng-jun Huang, Rong Jin, Zhi-hua Zhou

Abstract: Most active learning approaches select either informative or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criteria for query selection, they are usually ad hoc in ﬁnding unlabeled instances that are both informative and representative. We address this challenge by a principled approach, termed Q UIRE, based on the min-max view of active learning. The proposed approach provides a systematic way for measuring and combining the informativeness and representativeness of an instance. Extensive experimental results show that the proposed Q UIRE approach outperforms several state-of -the-art active learning approaches. 1

4 0.12721786 23 nips-2010-Active Instance Sampling via Matrix Partition

Author: Yuhong Guo

Abstract: Recently, batch-mode active learning has attracted a lot of attention. In this paper, we propose a novel batch-mode active learning approach that selects a batch of queries in each iteration by maximizing a natural mutual information criterion between the labeled and unlabeled instances. By employing a Gaussian process framework, this mutual information based instance selection problem can be formulated as a matrix partition problem. Although matrix partition is an NP-hard combinatorial optimization problem, we show that a good local solution can be obtained by exploiting an effective local optimization technique on a relaxed continuous optimization problem. The proposed active learning approach is independent of employed classiﬁcation models. Our empirical studies show this approach can achieve comparable or superior performance to discriminative batch-mode active learning methods. 1

5 0.1099285 22 nips-2010-Active Estimation of F-Measures

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of estimating the Fα -measure of a given model as accurately as possible on a ﬁxed labeling budget. This problem occurs whenever an estimate cannot be obtained from held-out training data; for instance, when data that have been used to train the model are held back for reasons of privacy or do not reﬂect the test distribution. In this case, new test instances have to be drawn and labeled at a cost. An active estimation procedure selects instances according to an instrumental sampling distribution. An analysis of the sources of estimation error leads to an optimal sampling distribution that minimizes estimator variance. We explore conditions under which active estimates of Fα -measures are more accurate than estimates based on instances sampled from the test distribution. 1

6 0.10751467 243 nips-2010-Smoothness, Low Noise and Fast Rates

7 0.10093101 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

8 0.10066566 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

9 0.091883987 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

10 0.091280445 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

11 0.087209448 261 nips-2010-Supervised Clustering

12 0.083148502 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

13 0.070256047 24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification

14 0.069734156 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

15 0.059061669 142 nips-2010-Learning Bounds for Importance Weighting

16 0.056911174 15 nips-2010-A Theory of Multiclass Boosting

17 0.056500915 235 nips-2010-Self-Paced Learning for Latent Variable Models

18 0.056178875 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

19 0.054946136 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

20 0.054137919 175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.143), (1, 0.029), (2, 0.117), (3, -0.042), (4, 0.018), (5, 0.114), (6, -0.157), (7, -0.112), (8, 0.021), (9, -0.196), (10, 0.1), (11, -0.013), (12, -0.123), (13, -0.118), (14, -0.053), (15, -0.028), (16, -0.127), (17, 0.037), (18, -0.054), (19, 0.038), (20, -0.017), (21, 0.116), (22, -0.072), (23, 0.027), (24, -0.028), (25, -0.034), (26, -0.036), (27, 0.057), (28, 0.07), (29, -0.032), (30, 0.016), (31, -0.056), (32, -0.061), (33, -0.012), (34, -0.004), (35, -0.062), (36, -0.069), (37, 0.05), (38, 0.002), (39, 0.073), (40, 0.011), (41, -0.05), (42, 0.057), (43, 0.012), (44, -0.065), (45, 0.085), (46, 0.019), (47, 0.001), (48, 0.011), (49, -0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94766903 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

2 0.80250752 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

3 0.79853803 25 nips-2010-Active Learning by Querying Informative and Representative Examples

Author: Sheng-jun Huang, Rong Jin, Zhi-hua Zhou

4 0.67515606 23 nips-2010-Active Instance Sampling via Matrix Partition

Author: Yuhong Guo

5 0.67363602 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

Author: Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman

Abstract: We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the database. We propose two hashingbased solutions. Our ﬁrst approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reﬂects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our ﬁrst method’s preprocessing stage is more efﬁcient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classiﬁer as a query, our algorithm identiﬁes those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods’ tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points. 1

6 0.65042841 22 nips-2010-Active Estimation of F-Measures

7 0.60539168 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

8 0.56101358 142 nips-2010-Learning Bounds for Importance Weighting

9 0.50131643 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

10 0.49489033 24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification

11 0.4287971 191 nips-2010-On the Theory of Learnining with Privileged Information

12 0.4199819 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

13 0.41265795 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

14 0.40747738 2 nips-2010-A Bayesian Approach to Concept Drift

15 0.39702526 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

16 0.38350895 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

17 0.36280778 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

18 0.35609007 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

19 0.34870327 175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers

20 0.3471559 243 nips-2010-Smoothness, Low Noise and Fast Rates

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.036), (27, 0.032), (30, 0.078), (45, 0.174), (50, 0.016), (52, 0.021), (60, 0.049), (77, 0.087), (78, 0.028), (82, 0.016), (90, 0.06), (92, 0.292)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8510989 157 nips-2010-Learning to localise sounds with spiking neural networks

Author: Dan Goodman, Romain Brette

Abstract: To localise the source of a sound, we use location-speciﬁc properties of the signals received at the two ears caused by the asymmetric ﬁltering of the original sound by our head and pinnae, the head-related transfer functions (HRTFs). These HRTFs change throughout an organism’s lifetime, during development for example, and so the required neural circuitry cannot be entirely hardwired. Since HRTFs are not directly accessible from perceptual experience, they can only be inferred from ﬁltered sounds. We present a spiking neural network model of sound localisation based on extracting location-speciﬁc synchrony patterns, and a simple supervised algorithm to learn the mapping between synchrony patterns and locations from a set of example sounds, with no previous knowledge of HRTFs. After learning, our model was able to accurately localise new sounds in both azimuth and elevation, including the difﬁcult task of distinguishing sounds coming from the front and back. Keywords: Auditory Perception & Modeling (Primary); Computational Neural Models, Neuroscience, Supervised Learning (Secondary) 1

same-paper 2 0.77287132 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

3 0.71624482 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

Author: George Konidaris, Scott Kuindersma, Roderic Grupen, Andre S. Barreto

Abstract: We introduce CST, an algorithm for constructing skill trees from demonstration trajectories in continuous reinforcement learning domains. CST uses a changepoint detection method to segment each trajectory into a skill chain by detecting a change of appropriate abstraction, or that a segment is too complex to model as a single skill. The skill chains from each trajectory are then merged to form a skill tree. We demonstrate that CST constructs an appropriate skill tree that can be further reﬁned through learning in a challenging continuous domain, and that it can be used to segment demonstration trajectories on a mobile manipulator into chains of skills where each skill is assigned an appropriate abstraction. 1

4 0.71187419 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling

Author: Bela Frigyik, Maya Gupta, Yihua Chen

Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-speciﬁc prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a ﬁxed generic model q0 [2]. But once q0 and λ are ﬁxed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classiﬁer estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We ﬁrst discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difﬁcult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difﬁcult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefﬁcient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more ﬂexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps ﬁrst noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the ﬂexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some ﬁxed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between ﬁnite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be proﬁtably used as a design parameter that is chosen based on application-speciﬁc knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for speciﬁc values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct speciﬁed bounds on the variation, we ﬁrst analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufﬁcient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by deﬁnition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be deﬁned as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a speciﬁc choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be speciﬁed (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then ﬁnding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward:  N 1 arg max log + log  p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1   1 αj −1  (˜i )j q , ≡ arg max log  B(α)N i j α∈Rk +  N α −1 (M −1 qi )j j  i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already speciﬁed M , we ﬁrst discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By deﬁnition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of ﬁve samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM ﬁrst, for example, by ﬁnding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to ﬁnd the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then ﬁnding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Speciﬁcally, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a ﬁnite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simpliﬁed by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, ﬁnding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that ﬁnds the d optimal such convex combinations such that a speciﬁced percentage of the sample pmfs are within the support, which reduces the non-convex problem of ﬁnding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual conﬁrmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to ﬁnd the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next ﬁve months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classiﬁcation: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9

5 0.58745891 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efﬁcient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efﬁcient method for implementing mixture forecasting strategies. 1

6 0.58729738 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

7 0.58411586 27 nips-2010-Agnostic Active Learning Without Constraints

8 0.58347374 243 nips-2010-Smoothness, Low Noise and Fast Rates

9 0.58181375 282 nips-2010-Variable margin losses for classifier design

10 0.58003944 63 nips-2010-Distributed Dual Averaging In Networks

11 0.57938671 117 nips-2010-Identifying graph-structured activation patterns in networks

12 0.57893372 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data

13 0.57862151 142 nips-2010-Learning Bounds for Importance Weighting

14 0.57819945 234 nips-2010-Segmentation as Maximum-Weight Independent Set

15 0.57810104 15 nips-2010-A Theory of Multiclass Boosting

16 0.57591242 148 nips-2010-Learning Networks of Stochastic Differential Equations

17 0.57585454 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

18 0.57550776 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

19 0.57536525 265 nips-2010-The LASSO risk: asymptotic results and real world examples

20 0.57407528 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models