nips nips2011 nips2011-27 knowledge-graph by maker-knowledge-mining

27 nips-2011-Advice Refinement in Knowledge-Based SVMs

Source: pdf

Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik

Abstract: Knowledge-based support vector machines (KBSVMs) incorporate advice from domain experts, which can improve generalization signiﬁcantly. A major limitation that has not been fully addressed occurs when the expert advice is imperfect, which can lead to poorer models. We propose a model that extends KBSVMs and is able to not only learn from data and advice, but also simultaneously improves the advice. The proposed approach is particularly effective for knowledge discovery in domains with few labeled examples. The proposed model contains bilinear constraints, and is solved using two iterative approaches: successive linear programming and a constrained concave-convex approach. Experimental results demonstrate that these algorithms yield useful reﬁnements to expert advice, as well as improve the performance of the learning algorithm overall.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Knowledge-based support vector machines (KBSVMs) incorporate advice from domain experts, which can improve generalization signiﬁcantly. [sent-10, score-0.804]

2 A major limitation that has not been fully addressed occurs when the expert advice is imperfect, which can lead to poorer models. [sent-11, score-0.771]

3 The proposed model contains bilinear constraints, and is solved using two iterative approaches: successive linear programming and a constrained concave-convex approach. [sent-14, score-0.208]

4 1 Introduction We are primarily interested in learning in domains where there is only a small amount of labeled data but advice can be provided by a domain expert. [sent-16, score-0.766]

5 Many of the rule-learning methods focus on rule reﬁnement to learn better rules, while ANNs form the rules as portions of the network which are reﬁned by backpropagation. [sent-21, score-0.121]

6 KBSVMs have been extensively studied, and in addition to linear classiﬁcation, they have been extended to incorporate kernels [5], nonlinear advice [14] and for kernel approximation [13]. [sent-25, score-0.728]

7 Extensive empirical results from this prior work establish that expert advice can be effective, especially for biomedical applications such as breast-cancer diagnosis. [sent-29, score-0.771]

8 Rather than simply ignoring or heavily penalizing inaccurate rules, the effectiveness of the advice can be improved through reﬁnement. [sent-32, score-0.767]

9 A piece of advice set 1 extends over the margin, and is penalized as the advice error. [sent-34, score-1.456]

10 , none of the rules in advice set 2 are useful as support constraints. [sent-37, score-0.836]

11 (right) SVM that reﬁnes advice in two ways: (1) advice set 1 is reﬁned so that no part of is on the wrong side of the optimal hyperplane, minimizing advice error, (2) advice set 2 is expanded until it touches the optimal margin thus maximizing coverage of input space. [sent-38, score-2.968]

12 First, advice is speciﬁed as polyhedral regions in input space, whose constraints on the features are easily interpretable by non-experts. [sent-42, score-0.846]

13 Second, it is well-known that KBSVMs can learn to generalize well with small data sets [9], and can even learn from advice alone [6]. [sent-43, score-0.728]

14 Finally, owing to the simplicity of the formulation, advice-reﬁnement terms for the rules can be incorporated directly into the model. [sent-44, score-0.123]

15 We further motivate advice reﬁnement in KBSVMs with the following example. [sent-45, score-0.728]

16 As mentioned before, expert rules are speciﬁed in the KBSVM framework as polyhedral advice regions in input space. [sent-48, score-0.918]

17 They introduce a bias to focus the learner on a model that also includes the advice of the form ∀x, (x ∈ advice region i) ⇒ class(x) = 1. [sent-49, score-1.476]

18 In the KBSVM (Figure 1, center), each advice region contributes to the ﬁnal hypothesis in a KBSVM via its advice vector, u1 and u2 (as introduced in [6]; also see Section 2). [sent-51, score-1.476]

19 The individual constraints that touch or intersect the margin have non-zero ui components. [sent-52, score-0.223]

20 As a piece of advice j region 1 extends beyond the margin, u1 = 0; furthermore, analogous to data error, this overlap is penalized as the advice error. [sent-53, score-1.476]

21 As no part of advice set 2 touches the margin, u2 = 0 and none of its rules contribute anything to the ﬁnal classiﬁer. [sent-54, score-0.834]

22 Again, analogous to support vectors, rules with non-zero ui components are called support constraints [6]. [sent-55, score-0.339]

23 Consequently, in the ﬁnal classiﬁer the j advice sets are incorporated with advice error (advice set 1) or are completely ignored (advice set 2). [sent-56, score-1.476]

24 However, simply penalizing advice that introduces errors can make learning difﬁcult as the user must carefully trade off between optimizing data or advice loss. [sent-58, score-1.456]

25 Now, consider an SVM that is capable of reﬁning inaccurate advice (Figure 1, right). [sent-59, score-0.767]

26 When advice is inaccurate and intersects the hyperplane, it is truncated such that it minimizes the advice error. [sent-60, score-1.495]

27 The optimal classiﬁer has now minimized the error with respect to the data and the reﬁned advice and is able to further improve upon the performance of not just the SVM but also the KBSVM. [sent-62, score-0.728]

28 Thus, the goal is to reﬁne potentially inaccurate expert advice during learning so as to learn a model with the best generalization. [sent-63, score-0.81]

29 [12], to produce a model that corrects the polyhedral advice regions of KBSVMs. [sent-65, score-0.8]

30 The resulting mathematical program is no longer a linear or quadratic program owing to bilinear correction factors in the constraints. [sent-66, score-0.141]

31 We propose two algorithmic techniques to solve the resulting bilinear program, one based on successive linear programming [12], and the other based on a concave-convex procedure [24]. [sent-67, score-0.224]

32 Before we describe advice reﬁnement, we brieﬂy introduce our notation and KBSVMs. [sent-68, score-0.728]

33 We assume that m advice sets (Di , di , zi )m are given in addition i=1 to the data (see Section 2), and if the i-th advice set has ki constraints, we have Di ∈ Rki ×n , di ∈ Rki and zi = {±1}. [sent-71, score-2.02]

34 2 Knowledge-Based Support Vector Machines In KBSVMs, advice can be speciﬁed about every potential data point in the input space that satisﬁes certain advice constraints. [sent-74, score-1.456]

35 The National Institute for Health (NIH) provides the following guidelines to establish risk for Type-2 Diabetes1 : a person who is obese (bmi ≥ 30) with gluc ≥ 126 is at strong risk for diabetes, while a person who is at normal weight (bmi ≤ 25) with gluc ≤ 100 is unlikely to have diabetes. [sent-76, score-0.312]

36 This leads to two advice sets, one for each class: (bmi ≤ 25) ∧ (gluc ≤ 100) ⇒ ¬diabetes; (bmi ≥ 30) ∧ (gluc ≥ 126) ⇒ diabetes, (1) where ¬ is the negation operator. [sent-77, score-0.728]

37 , m, can be incorporated into (3) using the nonhomogeneous Farkas theorem of the alternative [6] that introduces advice vectors ui . [sent-85, score-0.917]

38 The advice vectors perform the same role as the dual multipliers α in the classical SVM. [sent-86, score-0.728]

39 Similarly, the constraints of an advice set which have non-zero ui s are called support constraints. [sent-88, score-0.959]

40 The resulting formulation is the KBSVM, which optimizes model complexity + λ data loss + µ advice loss: m mini w 1 + λe′ ξ + µ i=1 (e′ η i + ζi ) i w,b,(ξ,u ,η ,ζi )≥0 s. [sent-89, score-0.762]

41 Y (Xw − be) + ξ ≥ e, ′ −η i ≤ Di ui + zi w ≤ η i , (4) ′ −di ui − zi b + ζi ≥ 1, i = 1, . [sent-91, score-0.59]

42 In the case of inaccurate advice, the advice errors η i and ζi soften the advice constraints analogous to the data errors ξ. [sent-95, score-1.524]

43 Returning to Figure 1, for advice set 1, η 1 , ζ1 and u1 are non-zero, while for advice set 2, u2 = 0. [sent-96, score-1.456]

44 The inﬂuence of data and advice is determined by the choice of the parameters λ and µ which reﬂect the user’s trust in the data and advice respectively. [sent-97, score-1.456]

45 [12] formulated a model to reﬁne advice in KBSVMs. [sent-99, score-0.728]

46 However, their model is limited as only the terms di are reﬁned, which as we discuss below, greatly restricts the types of reﬁnements that are possible. [sent-100, score-0.156]

47 They only consider reﬁnement terms f i for the right hand side of the i-th advice set, and attempt to reﬁne each rule such that Di x ≤ (di − f i ) ⇒ zi (w′ x − b) ≥ 1, i = 1, . [sent-101, score-0.9]

48 (5) The resulting formulation adds reﬁnement terms into the KBSVM model (4) in the advice constraints, as well as in the objective. [sent-105, score-0.746]

49 1 Y (Xw − be) + ξ ≥ e, ′ −η i ≤ Di ui + zi w ≤ η i , i i ′ i −(d − f ) u − zi b + ζi ≥ 1, i = 1, . [sent-109, score-0.421]

50 gov/DM/pubs/∼riskfortype2 3 (6) ′ This problem is no longer an LP owing to the bilinear terms f i ui which make the reﬁnement constraints non-convex. [sent-116, score-0.318]

51 solve this problem using successive linear programming (SLP) wherein linear programs arising from alternately ﬁxing either the advice terms di or the reﬁnement terms f i are solved iteratively. [sent-118, score-1.049]

52 We consider a full generalization of the RRSVM approach and develop a model where it is possible to reﬁne the entire advice region Dx ≤ d. [sent-119, score-0.748]

53 This allows for much more ﬂexibility in reﬁning the advice based on the data, while still retaining interpretability of the resulting reﬁned advice. [sent-120, score-0.728]

54 In addition to the terms f i , we propose the introduction of additional reﬁnement terms Fi into the model, so that we can reﬁne the rules in as general a manner as possible: (Di − Fi )x ≤ (di − f i ) ⇒ zi (w′ x − b) ≥ 1, i = 1, . [sent-121, score-0.201]

55 (7) Recall that for each advice set we have Di ∈ Rki ×n and di ∈ Rki , i. [sent-125, score-0.884]

56 The corresponding reﬁnement terms Fi and f i will have the same dimensions respectively as Di and di . [sent-128, score-0.156]

57 1 m i=1 + λe′ ξ + µ (e′ η i + ζi ) + ν m i=1 Fi 1 + fi Y (Xw − be) + ξ ≥ e, −η i ≤ (Di − Fi )′ ui + zi w ≤ η i , −(di − f i )′ ui − zi b + ζi ≥ 1, i = 1, . [sent-131, score-0.723]

58 1 (8) The objective function of (8) trades-off the effect of reﬁnement in each of the advice sets via the reﬁnement parameter ν. [sent-135, score-0.728]

59 Second, the newly added reﬁnement terms, Fi′ ui , are bilinear also, and do not make the overall problem more complex; in addition to the successive linear programming approach of [12], we also propose a concave-convex procedure that leads to an approach based on successive quadratic programming. [sent-139, score-0.477]

60 1 arkSVMs via Successive Linear Programming One approach to solving bilinear programming problems is to solve a sequence of linear programs while alternately ﬁxing the bilinear variables. [sent-142, score-0.27]

61 f i=1 • (Reﬁnement Step) When the advice-estimate terms {ˆ i,t }m are ﬁxed, the resulting LP u i=1 solves for (Fi , f i )m and attempts to further reﬁne the advice regions based on estimates i=1 from data computed in the previous step. [sent-147, score-0.747]

62 Such an accumulation point satisﬁes the local minimum condition ¯ b) (w, ¯ ∈ min w 1 + λe′ ξ + µ m (e′ η i + ζi ) i=1 ui ≥0 w,b,(ξ,η i ζi ≥0) subject to Y (Xw − be) + ξ ≥ e, ¯ −η i ≤ (Di − Fi )′ ui + zi w ≤ η i , i ¯i )′ ui − zi b + ζi ≥ 1, −(d − f i = 1, . [sent-151, score-0.759]

63 5: 1 return failure ˆ (reﬁnement step) solve for (Fit+1 , ˆi,t+1 )m f i=1 min w w,b,Fi ,f i ,(ξ,η i ,ζi )≥0 1 m i=1 + λe′ ξ + µ (e′ η i + ζi ) + ν m i=1 Fi 1 + fi 1 Y (Xw − be) + ξ ≥ e, s. [sent-161, score-0.149]

64 ˆ −η i ≤ (Di − Fi )′ ui,t+1 + zi w ≤ η i , ˆ −(di − f i )′ ui,t+1 − zi b + ζi ≥ 1, i = 1, . [sent-163, score-0.252]

65 5: (termination test) if 6: (continue) t = t + 1 7: end while 1 m i=1 + λe′ ξ + µ return failure (e′ η i + ζi ) + ν m i=1 Fi 1 + fi 1 Y (Xw − be) + ξ ≥ e, eqns (10–12), i = 1, . [sent-169, score-0.133]

66 In the constraint (Di − Fi )′ ui + zi w − η i ≤ 0, only the reﬁnement term Fi′ ui is bilinear, while the rest of the constraint is linear. [sent-177, score-0.464]

67 Thus, we have the equivalent 4 constraint 1 1 ′ i Dij ui + zi wj − ηj + Fij − ui 2 ≤ Fij + ui 2 , (9) 4 4 and both sides of the constraint above are convex and quadratic. [sent-180, score-0.651]

68 We can linearize the right-hand side ˆt ˆ of (9) around some current estimate of the bilinear variables (Fij , ui,t ): ′ i Dij ui + zi wj − ηj + 1 4 Fij − ui 2 1 ˆt 4 Fij ˆt + 1 (Fij 2 ≤ ˆ + ui,t 2 ˆt ˆ ˆ + ui,t )′ (Fij − Fij ) + (ui − ui,t ) . [sent-181, score-0.574]

69 (10) Similarly, the constraint −(Di − Fi )′ ui − zi w − η i ≤ 0, can be replaced by ′ i −Dij ui − zi wj − ηj + 1 4 Fij + ui 2 1 ˆt 4 Fij 1 ˆt + 2 (Fij ≤ 5 ˆ − ui,t 2 ˆt ˆ ˆ − ui,t )′ (Fij − Fij ) − (ui − ui,t ) , (11) Figure 2: Toy data set (Section 4. [sent-182, score-0.777]

70 ′ ′ while di ui + zi b + 1 − ζi − f i ui ≤ 0 is replaced by ′ di ui + zi b + 1 − ζi + 1 4 f i − ui 2 1 ˆi,t ˆ + ui,t 2 4 f ˆ + 1 (ˆi,t + ui,t )′ 2 f ≤ ˆ (f i,t − ˆi,t ) + (ui − ui,t ) . [sent-187, score-1.24]

71 Replacing the original bilinear non-convex constraints of (8) with the convexiﬁed relaxations results in a quadratically-constrained linear program (QCLP). [sent-189, score-0.121]

72 For either solution, the following proposition holds, which shows that either algorithm produces a reﬁnement of the original polyhedral advice regions. [sent-198, score-0.799]

73 ¯ b, ¯ ¯ f ¯ ¯ ¯ Proposition 3 Let (w, ¯ ui , Fi , ¯i , ξ, η i , ζi ) be the local minimum solution produced by Algorithm 1 or Algorithm 2. [sent-201, score-0.169]

74 Then, the following reﬁnement to the advice sets holds: ¯ ¯ ¯ ˆ (Di − Fi ) ≤ (di − ¯i ) ⇒ zi (w′ x − ¯ ≥ −η i ′ x − ζi , f b) ′¯ ¯ ˆ ¯ ¯ ˆ where −η i ≤ η i ≤ η i such that Di ui + w + η i = 0. [sent-202, score-1.023]

75 1 Toy Example We illustrate the behavior of advice reﬁnement algorithms discussed previously geometrically using a simple 2-dimensional example (Figure 2). [sent-206, score-0.728]

76 There are two advice sets: {S1 : (x1 , x2 ) ≥ 0 ⇒ z = +1}, {S2 : (x1 , x2 ) ≤ 0 ⇒ 2 http://www2. [sent-208, score-0.728]

77 edu/∼wcook/qsopt/ 6 40 35 Testing Error (%) 30 25 20 15 svm kbsvm rrsvm arksvm−sla arksvm−sqp 10 5 0 0 10 20 30 40 50 60 70 80 Number of Training Examples 90 100 Figure 3: Diabetes data set, Section 4. [sent-211, score-0.197]

78 In addition, the reﬁnement terms allow for sufﬁcient modiﬁcation of the advice sets Dx ≤ d so that they ﬁll the input space as much as possible, without violating the margin. [sent-218, score-0.728]

79 Comparing to RRSVMs, we see that reﬁnement is restrictive because corrections are applied only to part of the advice sets, rather than fully correcting the advice. [sent-219, score-0.728]

80 Studies [15] show that diabetes incidence among the Pima Indians is signiﬁcantly higher among subjects with bmi ≥ 30. [sent-223, score-0.425]

81 In addition, a person with impaired glucose tolerance is at a signiﬁcant risk for, or worse, has undiagnosed diabetes [8]. [sent-224, score-0.421]

82 The diabetes pedigree function was developed by Smith et al. [sent-226, score-0.349]

83 [18], and uses genetic information from family relatives to provide a measure of the expected genetic inﬂuence (heredity) on the subject’s diabetes risk. [sent-227, score-0.346]

84 A subject with high heredity who is at least 31 is at a signiﬁcantly increased risk for diabetes in the next ﬁve years: (Diabetes Rule 5) (Diabetes Rule 6) (pedf ≤ 0. [sent-229, score-0.346]

85 Figure 3 (left) shows that unreﬁned advice does help initially, especially with as few as 30 data points. [sent-232, score-0.728]

86 However, as more data points are available, the effect of the advice diminishes. [sent-233, score-0.728]

87 In contrast, the advice reﬁning methods are able to generalize much better with few data points, and eventually converge to a better solution. [sent-234, score-0.728]

88 This tree was constructed by sampling the space around reﬁned advice region uniformly, and then training a decision tree that covers as many of the sampled points as possible. [sent-236, score-0.748]

89 This naive approach to rule extraction from reﬁned advice is shown here only to illustrate that it is possible to produce very useful domain-expert-interpretable rules from reﬁnement. [sent-237, score-0.868]

90 More efﬁcient and accurate rule extraction techniques inspired by SVM-based rule extraction (for example, [7]) are currently under investigation. [sent-238, score-0.13]

91 7 30 svm kbsvm rrsvm arksvm−sla arksvm−sqp Testing Error (%) 25 20 15 10 5 0 20 40 60 80 Number of Training Examples 100 Figure 4: Wargus data set, Section 4. [sent-239, score-0.197]

92 Once the humans learned this task, they were asked to provide advice via a GUI-based interface based on speciﬁc examples. [sent-260, score-0.728]

93 This setting lends itself very well to reﬁnement as the advice collected from human experts represents the sum of their experiences with the domain, but is by no means perfect or exact. [sent-261, score-0.765]

94 5 Conclusions and Future Work We have presented two novel knowledge-discovery methods: arkSVM-sla and arkSVM-sqp, that allow SVM methods to not only make use of advice provided by human experts but to reﬁne that advice using labeled data to improve the advice. [sent-267, score-1.509]

95 These methods are an advance over previous knowledge-based SVM methods which either did not reﬁne advice [6] or could only reﬁne simple aspects of the advice [12]. [sent-268, score-1.456]

96 Experimental results demonstrate that our arkSVM methods can make use of inaccurate advice to revise them to better ﬁt the data. [sent-269, score-0.767]

97 Prevalence of diabetes, impaired fasting glucose, and impaired glucose tolerance in U. [sent-348, score-0.109]

98 Reﬁning rules incorporated into knowledge-based support vector learners via successive linear programming. [sent-384, score-0.207]

99 Changing patterns of Type 2 diabetes incidence among Pima Indians. [sent-414, score-0.346]

100 Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. [sent-441, score-0.326]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('advice', 0.728), ('diabetes', 0.326), ('nement', 0.19), ('ui', 0.169), ('di', 0.156), ('gluc', 0.139), ('kbsvms', 0.139), ('fi', 0.133), ('fij', 0.132), ('zi', 0.126), ('arksvm', 0.116), ('re', 0.115), ('kbsvm', 0.104), ('wargus', 0.104), ('bilinear', 0.092), ('maclin', 0.081), ('bmi', 0.079), ('successive', 0.079), ('rules', 0.075), ('xw', 0.061), ('fjt', 0.058), ('kunapuli', 0.058), ('rrsvm', 0.058), ('pima', 0.056), ('polyhedral', 0.053), ('glucose', 0.047), ('archers', 0.046), ('arksvms', 0.046), ('indians', 0.046), ('mangasarian', 0.046), ('shavlik', 0.046), ('rule', 0.046), ('expert', 0.043), ('rki', 0.041), ('inaccurate', 0.039), ('programming', 0.037), ('fit', 0.037), ('experts', 0.037), ('svm', 0.035), ('hasmoat', 0.035), ('pedf', 0.035), ('rrsvms', 0.035), ('fung', 0.033), ('support', 0.033), ('impaired', 0.031), ('touches', 0.031), ('tower', 0.031), ('walker', 0.031), ('constraints', 0.029), ('owing', 0.028), ('bennett', 0.027), ('feasible', 0.026), ('trades', 0.025), ('nements', 0.025), ('margin', 0.025), ('age', 0.024), ('svms', 0.024), ('dj', 0.023), ('ning', 0.023), ('anns', 0.023), ('ballistas', 0.023), ('duluth', 0.023), ('eb', 0.023), ('footmen', 0.023), ('hatched', 0.023), ('knowler', 0.023), ('pedigree', 0.023), ('sla', 0.023), ('sqp', 0.023), ('domain', 0.022), ('quadratic', 0.021), ('fj', 0.021), ('dij', 0.021), ('machines', 0.021), ('heredity', 0.02), ('incidence', 0.02), ('lps', 0.02), ('plasma', 0.02), ('relatives', 0.02), ('health', 0.02), ('incorporated', 0.02), ('region', 0.02), ('regions', 0.019), ('extraction', 0.019), ('smola', 0.018), ('formulation', 0.018), ('proposition', 0.018), ('wj', 0.018), ('interpretable', 0.017), ('person', 0.017), ('toy', 0.017), ('alternately', 0.017), ('care', 0.017), ('solve', 0.016), ('programs', 0.016), ('labeled', 0.016), ('optimizes', 0.016), ('wrt', 0.016), ('cccp', 0.016), ('players', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik

2 0.067715682 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation conﬁrms the method’s applicability to learning both stationary and temporal spike patterns. 1

3 0.04676412 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

Author: Sebastian A. Kurtek, Anuj Srivastava, Wei Wu

Abstract: While signal estimation under random amplitudes, phase shifts, and additive noise is studied frequently, the problem of estimating a deterministic signal under random time-warpings has been relatively unexplored. We present a novel framework for estimating the unknown signal that utilizes the action of the warping group to form an equivalence relation between signals. First, we derive an estimator for the equivalence class of the unknown signal using the notion of Karcher mean on the quotient space of equivalence classes. This step requires the use of Fisher-Rao Riemannian metric and a square-root representation of signals to enable computations of distances and means under this metric. Then, we deﬁne a notion of the center of a class and show that the center of the estimated class is a consistent estimator of the underlying unknown signal. This estimation algorithm has many applications: (1) registration/alignment of functional data, (2) separation of phase/amplitude components of functional data, (3) joint demodulation and carrier estimation, and (4) sparse modeling of functional data. Here we demonstrate only (1) and (2): Given signals are temporally aligned using nonlinear warpings and, thus, separated into their phase and amplitude components. The proposed method for signal alignment is shown to have state of the art performance using Berkeley growth, handwritten signatures, and neuroscience spike train data. 1

4 0.04614231 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely diﬀerent approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the ﬁrst computationally eﬃcient online algorithm for collaborative ﬁltering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

5 0.043777443 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

Author: Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang

Abstract: This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efﬁcient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our ﬁnite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics. 1

6 0.043283179 288 nips-2011-Thinning Measurement Models and Questionnaire Design

7 0.042765792 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

8 0.042699859 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

9 0.04158707 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

10 0.041382194 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

11 0.03762348 258 nips-2011-Sparse Bayesian Multi-Task Learning

12 0.036835309 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

13 0.033856086 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

14 0.032378491 276 nips-2011-Structured sparse coding via lateral inhibition

15 0.03188248 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

16 0.031775765 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

17 0.031397674 220 nips-2011-Prediction strategies without loss

18 0.031213513 227 nips-2011-Pylon Model for Semantic Segmentation

19 0.031168398 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

20 0.030619174 240 nips-2011-Robust Multi-Class Gaussian Process Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.095), (1, -0.008), (2, -0.0), (3, -0.004), (4, 0.004), (5, -0.0), (6, -0.017), (7, -0.0), (8, -0.047), (9, -0.004), (10, 0.001), (11, 0.001), (12, 0.026), (13, -0.013), (14, -0.001), (15, 0.001), (16, -0.027), (17, 0.024), (18, 0.042), (19, 0.006), (20, 0.009), (21, -0.034), (22, 0.028), (23, 0.023), (24, 0.056), (25, 0.001), (26, -0.035), (27, -0.016), (28, -0.002), (29, 0.053), (30, -0.038), (31, -0.107), (32, 0.008), (33, -0.05), (34, -0.009), (35, -0.042), (36, 0.012), (37, 0.015), (38, 0.062), (39, 0.047), (40, -0.094), (41, -0.095), (42, 0.032), (43, -0.003), (44, -0.017), (45, -0.034), (46, 0.055), (47, -0.009), (48, 0.008), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90676373 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik

2 0.53798306 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta

Abstract: In this paper, we derive a method to reﬁne a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to ﬁnd a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The speciﬁc example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We deﬁne a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also reﬁnement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer afﬁliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conﬂicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to ﬁnd the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efﬁciently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be speciﬁed by experts. For instance, the inﬂuence of one variable on another, or the combined inﬂuence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the speciﬁcation of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly speciﬁed by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Inﬂuence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] deﬁnes the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisﬁed: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to reﬁne the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the deﬁnition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisﬁed, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this deﬁnition of the CPT for V is that a constraint is more likely to be satisﬁed if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully deﬁned given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisﬁed (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisﬁed (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modiﬁed versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efﬁcient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisﬁed (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difﬁcult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisﬁed by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to ﬁnd a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to ﬁnd the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us ﬁx the leak parameters and the link reliability from the ﬁrst cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to ﬁnd M MAP . To evaluate M MAP , we ﬁrst compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 conﬁrms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a ﬁxed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to ﬁnd the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to ﬁnd parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identiﬁed with a symptom represented by the ﬁrst test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model reﬁnement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the ﬁrst approach that exploits implicit constraints based on value of information assessments. Secondly it is the ﬁrst approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restiﬁcar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative inﬂuences. In Proceedings of Twenty Third International Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratiﬁcation in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artiﬁcial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artiﬁcial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), San Francisco, CA, July 2000. 9

3 0.47952616 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netﬂix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will proﬁt an amount that scales according to how much her update improves performance on a released test set. 1

4 0.46169078 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high ﬂuctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overﬁts even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1

5 0.4615224 288 nips-2011-Thinning Measurement Models and Questionnaire Design

Author: Ricardo Silva

Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in ﬁelds such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of reﬁning existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to ﬁnd an optimal subset of a given size. For that, we ﬁrst deﬁne an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other ﬁelds, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model speciﬁcation of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantiﬁes the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, deﬁned by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 deﬁnes a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is brieﬂy discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed speciﬁcally to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M speciﬁes the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-deﬁned, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of ﬁnding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efﬁciently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efﬁciently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be ﬁt by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same ﬁrst and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to deﬁne an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the ﬁrst i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shufﬂes neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) deﬁne a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we deﬁne the local term HL (z, i) as    HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i))  j∈S zj   k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-deﬁned Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simpliﬁed marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redeﬁne the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simpliﬁes marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the ﬁrst element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a ﬁxed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model ﬁtness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conﬂates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufﬁcient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a ﬁxed choice K of indicators to be kept. 5 Experiments In this section, we ﬁrst describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefﬁcients Λi such that the reliability (deﬁned in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difﬁculty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The ﬁrst three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefﬁcients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was ﬁlled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and ﬁt it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to ﬁrst narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the ﬁrst formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model speciﬁcation and transportability across studies. A measurement model built for a very speciﬁc population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufﬁcient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer ﬁle]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: ﬁxed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-ﬁrst Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9

6 0.43561694 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

7 0.40775415 271 nips-2011-Statistical Tests for Optimization Efficiency

8 0.40601221 181 nips-2011-Multiple Instance Learning on Structured Data

9 0.40432259 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

10 0.39845785 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

11 0.39064243 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

12 0.39008054 276 nips-2011-Structured sparse coding via lateral inhibition

13 0.38041183 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

14 0.37377572 277 nips-2011-Submodular Multi-Label Learning

15 0.37114856 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

16 0.37070164 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

17 0.36879644 240 nips-2011-Robust Multi-Class Gaussian Process Classification

18 0.36753449 111 nips-2011-Hashing Algorithms for Large-Scale Learning

19 0.36324477 33 nips-2011-An Exact Algorithm for F-Measure Maximization

20 0.36066055 180 nips-2011-Multiple Instance Filtering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (4, 0.058), (17, 0.358), (20, 0.027), (26, 0.016), (31, 0.056), (33, 0.02), (43, 0.064), (45, 0.089), (57, 0.023), (65, 0.014), (74, 0.04), (83, 0.069), (99, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74250036 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

Author: Gautam Kunapuli, Richard Maclin, Jude W. Shavlik

2 0.63095182 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1

3 0.57763022 276 nips-2011-Structured sparse coding via lateral inhibition

Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun

Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modiﬁcations of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efﬁcient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-speciﬁed. 1

4 0.40664539 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We ﬁnd that for large neural populations carrying a ﬁnite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we ﬁnd that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched ﬁring rates. We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.

5 0.40570155 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

6 0.40547672 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

7 0.4027811 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.40240937 258 nips-2011-Sparse Bayesian Multi-Task Learning

9 0.40178829 102 nips-2011-Generalised Coupled Tensor Factorisation

10 0.40043849 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

11 0.39986721 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

12 0.39884526 5 nips-2011-A Denoising View of Matrix Completion

13 0.3983534 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

14 0.39779598 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

15 0.39743507 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.39716521 150 nips-2011-Learning a Distance Metric from a Network

17 0.39698729 227 nips-2011-Pylon Model for Semantic Segmentation

18 0.39660504 29 nips-2011-Algorithms and hardness results for parallel large margin learning

19 0.39650798 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

20 0.39649409 219 nips-2011-Predicting response time and error rates in visual search