nips nips2011 nips2011-60 knowledge-graph by maker-knowledge-mining

60 nips-2011-Confidence Sets for Network Structure


Source: pdf

Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. [sent-9, score-0.266]

2 We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. [sent-11, score-0.698]

3 We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. [sent-12, score-0.549]

4 1 Introduction Network datasets comprising edge measurements Aij ∈ {0, 1} of a binary, symmetric, and antireflexive relation on a set of n nodes, 1 ≤ i < j ≤ n, are fast becoming of paramount interest in the statistical analysis and data mining literatures [1]. [sent-14, score-0.212]

5 A common aim of many models for such data is to test for and explain the presence of network structure, primary examples being communities and blocks of nodes that are equivalent in some formal sense. [sent-15, score-0.315]

6 Algorithmic formulations of this problem take varied forms and span many literatures, touching on subjects such as statistical physics [2, 3], theoretical computer science [4], economics [5], and social network analysis [6]. [sent-16, score-0.446]

7 One popular modeling assumption for network data is to assume dyadic independence of the edge measurements when conditioned on a set of latent variables [7, 8, 9, 10]. [sent-17, score-0.38]

8 The number of latent parameters in such models generally increases with the size of the graph, however, meaning that computationally intensive fitting algorithms may be required and standard consistency results may not always hold. [sent-18, score-0.112]

9 As a result, it can often be difficult to assess statistical significance or quantify the uncertainty associated with parameter estimates. [sent-19, score-0.099]

10 Confidence sets are a standard statistical tool for uncertainty quantification, but they are not yet well developed for network data. [sent-21, score-0.292]

11 In this paper, we propose a family of confidence sets for network structure that apply under the assumption of a Bernoulli product likelihood. [sent-22, score-0.33]

12 The form of these sets stems from a stochastic blockmodel formulation which reflects the notion of latent nodal classes, and they provide a new tool for the analysis of estimated or algorithmically determined network structure. [sent-23, score-0.737]

13 We demonstrate usage of the confidence sets by analyzing a sample of 26 adolescent friendship networks from the National Longitudinal Survey of Adolescent Health (available at http://www. [sent-24, score-0.668]

14 edu/addhealth), using a baseline model that only includes explanatory covariates and heterogeneity in the nodal degrees. [sent-27, score-0.468]

15 We employ these confidence sets to validate departures from this baseline model taking the form of residual community structure. [sent-28, score-0.528]

16 Though the confidence sets we employ are conservative, we show that they are effective in identifying putative residual structure in these friendship network data. [sent-29, score-0.729]

17 2 Model Specification and Inference We represent network data via a sociomatrix A ∈ {0, 1}N ×N that reflects the adjacency structure of a simple, undirected graph on N nodes. [sent-30, score-0.325]

18 We observe that in each partition, the number of apparently visible communities exceeds K, and they are comprised of small numbers of students. [sent-32, score-0.086]

19 This effect is due to the intersection of grade and z-induced clustering. [sent-33, score-0.204]

20 ¯ We take as our definition of nominal value the quantity Φ(z) computed under the base(z) line model, which we denote by Φ . [sent-34, score-0.067]

21 Table 2 lists normalized divergence terms N −1 ˆ (z) ||Φ(z) ), Bonferroni-corrected 95% confidence bounds, and measures of a≤b nab D(Φab ab 2 alignment between the corresponding partitions z and the explanatory variables. [sent-35, score-0.393]

22 The Jacaard similarity coefficient is defined as |A ∩ B|/|A ∪ B|, were A, B ⊂ N are the student pairings sharing the same latent class or the same covariate value, 2 respectively. [sent-38, score-0.297]

23 Variance ratio denotes the within-class degree variance divided by the total variance, averaged over all classes. [sent-40, score-0.053]

24 99 Table 2: Block structure assessments corresponding to Fig. [sent-126, score-0.079]

25 Small Jacaard coefficient values (for gender, race, and grade) and variance ratios approaching 1 for degree indicate a lack of alignment with covariates and hence the identification of residual structure in the corresponding partition. [sent-128, score-0.554]

26 4 Concluding Remarks In this article we have developed confidence sets for assessing inferred network structure, by leveraging our result derived in [14]. [sent-130, score-0.297]

27 We explored the use of these confidence sets with an application to the analysis of Adolescent Health survey data comprising friendship networks from 26 schools. [sent-131, score-0.519]

28 In lieu of a parametric model, we assume dyadic independence with Bernoulli parameters {Pji }. [sent-133, score-0.122]

29 We introduced a baseline model (K = 1) that incorporates degree and covariate effects, without block structure. [sent-134, score-0.318]

30 Algorithm 1 was then used to find highly assortative partitions of students which are also far from partitions induced by the explanatory covariates in the baseline model. [sent-135, score-0.673]

31 Differences in assortativity were quantified by an empirical divergence statistic, which was compared to an upper bound computed from Eq. [sent-136, score-0.048]

32 (3) to check for significance and to generate confidence sets for {Pij }. [sent-137, score-0.064]

33 (3) is known to be loose, simulation results in Figure 1 suggest that the slack is moderate, leading to useful confidence sets in practice. [sent-139, score-0.064]

34 In our procedure, we cannot quantify the uncertainty associated with the estimated baseline model, since the parameter estimates lack consistency. [sent-140, score-0.142]

35 However, for a baseline model where the MLE is known to be consistent, we conjecture that such a hypothesis test should be possible by incorporating the confidence set associated with the MLE. [sent-142, score-0.101]

36 Despite concerns regarding estimator consistency in this and other latent variable models, we were able to show that the notion of confidence sets may instead be used to provide a (conservative) measure of residual block structure. [sent-143, score-0.422]

37 We note that many open questions remain, and are hopeful that this analysis may help to shed light on some important current issues facing practitioners and theorists alike in statistical network analysis. [sent-144, score-0.261]

38 2, with nodes ordered by grade (solid lines) and corresponding latent classes (dotted lines). [sent-146, score-0.358]

39 Airoldi, “A survey of statistical network models”, Foundation and Trends in Machine Learning, vol. [sent-154, score-0.26]

40 Newman, “The structure and function of complex networks”, SIAM Review, vol. [sent-168, score-0.079]

41 Nowicki, “Estimation and prediction for stochastic blockmodels for graphs with latent block structure”, J. [sent-191, score-0.448]

42 Hoff, “Multiplicative latent factor models for description and prediction of social networks”, Computational Math. [sent-228, score-0.24]

43 Porter, “Comparing community structure to characteristics in online collegiate social networks”, SIAM Rev. [sent-253, score-0.334]

44 Chen, “A nonparametric view of network models and Newman-Girvan and other modularities”, Proc. [sent-258, score-0.187]

45 Airoldi, “Stochastic blockmodels with growing numbers of classes”, Biometrika, 2011, to appear. [sent-274, score-0.139]

46 Yu, “Spectral clustering and the high-dimensional stochastic blockmodel”, Ann. [sent-278, score-0.068]

47 Pierre, “Consistency of maximum-likelihood and variational estimators in the stochastic block model”, Arxiv preprint 1105. [sent-285, score-0.202]

48 Levina, and MEJ Newman, “Robustness of community structure in networks”, Phys. [sent-289, score-0.206]

49 Doye, “Thermodynamics of community structure”, Arxiv preprint cond-mat/0610077, 2006. [sent-300, score-0.17]

50 Kirman, “Identifying community structures from network data via maximum likelihood methods”, B. [sent-305, score-0.314]

51 Weaver, “Statistical analysis of binary relational data: parameter estimation”, J. [sent-334, score-0.045]

52 Hoff, “Modeling homophily and stochastic equivalence in symmetric relational data”, in Adv. [sent-343, score-0.154]

53 Morris, “Birds of a feather, or friend of a friend? [sent-352, score-0.07]

54 using exponential random graph models to investigate adolescent social networks”, Demography, vol. [sent-353, score-0.36]

55 Vicsek, “Community structure and ethnic a e preferences in school friendship networks”, Physica A. [sent-363, score-0.583]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('school', 0.3), ('adolescent', 0.232), ('dence', 0.23), ('grade', 0.204), ('airoldi', 0.204), ('friendship', 0.204), ('network', 0.187), ('covariates', 0.175), ('residual', 0.155), ('blockmodel', 0.15), ('blockmodels', 0.139), ('jacaard', 0.139), ('social', 0.128), ('community', 0.127), ('networks', 0.119), ('con', 0.115), ('literatures', 0.112), ('latent', 0.112), ('explanatory', 0.111), ('baseline', 0.101), ('race', 0.1), ('newman', 0.096), ('harvard', 0.096), ('partitions', 0.094), ('leinhardt', 0.093), ('alignment', 0.092), ('block', 0.091), ('bernoulli', 0.088), ('gender', 0.086), ('communities', 0.086), ('dyadic', 0.081), ('fienberg', 0.081), ('hoff', 0.081), ('nodal', 0.081), ('structure', 0.079), ('health', 0.079), ('conservative', 0.075), ('student', 0.075), ('wolfe', 0.075), ('algorithmically', 0.075), ('natl', 0.075), ('covariate', 0.073), ('survey', 0.073), ('holland', 0.07), ('friend', 0.07), ('stochastic', 0.068), ('nominal', 0.067), ('wasserman', 0.067), ('longitudinal', 0.067), ('sets', 0.064), ('jackson', 0.064), ('pij', 0.059), ('adjacency', 0.059), ('comprising', 0.059), ('patrick', 0.059), ('cance', 0.059), ('assess', 0.058), ('students', 0.057), ('quanti', 0.056), ('choi', 0.055), ('june', 0.053), ('degree', 0.053), ('economics', 0.051), ('aij', 0.05), ('usage', 0.049), ('ab', 0.048), ('divergence', 0.048), ('article', 0.046), ('relational', 0.045), ('preprint', 0.043), ('nodes', 0.042), ('uncertainty', 0.041), ('zheng', 0.041), ('haberman', 0.041), ('weaver', 0.041), ('goldenberg', 0.041), ('assortative', 0.041), ('chatterjee', 0.041), ('cooper', 0.041), ('gonz', 0.041), ('homophily', 0.041), ('karrer', 0.041), ('lieu', 0.041), ('mucha', 0.041), ('paramount', 0.041), ('porter', 0.041), ('raftery', 0.041), ('rohe', 0.041), ('signifying', 0.041), ('touching', 0.041), ('traud', 0.041), ('validate', 0.041), ('employ', 0.04), ('physics', 0.039), ('cambridge', 0.039), ('graphs', 0.038), ('pairings', 0.037), ('jaccard', 0.037), ('alike', 0.037), ('facing', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 60 nips-2011-Confidence Sets for Network Structure

Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1

2 0.092617184 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

Author: Jun Zhu, Ning Chen, Eric P. Xing

Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.

3 0.08547423 217 nips-2011-Practical Variational Inference for Neural Networks

Author: Alex Graves

Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1

4 0.085453957 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion

Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.

5 0.078597091 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

6 0.072935641 150 nips-2011-Learning a Distance Metric from a Network

7 0.071961969 285 nips-2011-The Kernel Beta Process

8 0.069387257 302 nips-2011-Variational Learning for Recurrent Spiking Networks

9 0.068971999 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

10 0.068526715 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

11 0.065302357 250 nips-2011-Shallow vs. Deep Sum-Product Networks

12 0.064498089 258 nips-2011-Sparse Bayesian Multi-Task Learning

13 0.058396041 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

14 0.057179593 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

15 0.054830402 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

16 0.052001458 186 nips-2011-Noise Thresholds for Spectral Clustering

17 0.051159363 306 nips-2011-t-divergence Based Approximate Inference

18 0.050679702 288 nips-2011-Thinning Measurement Models and Questionnaire Design

19 0.049387366 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

20 0.048111949 301 nips-2011-Variational Gaussian Process Dynamical Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, 0.014), (2, 0.024), (3, -0.028), (4, -0.025), (5, -0.104), (6, -0.005), (7, -0.02), (8, 0.055), (9, -0.027), (10, -0.038), (11, -0.038), (12, -0.048), (13, -0.036), (14, -0.001), (15, -0.005), (16, -0.043), (17, -0.022), (18, 0.063), (19, -0.028), (20, 0.016), (21, 0.0), (22, -0.082), (23, 0.065), (24, 0.009), (25, -0.047), (26, 0.023), (27, -0.012), (28, 0.012), (29, 0.11), (30, 0.126), (31, -0.035), (32, -0.019), (33, 0.081), (34, 0.007), (35, -0.015), (36, -0.022), (37, 0.067), (38, -0.008), (39, -0.088), (40, -0.002), (41, 0.05), (42, 0.129), (43, 0.048), (44, -0.02), (45, 0.068), (46, -0.013), (47, 0.02), (48, 0.0), (49, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96450806 60 nips-2011-Confidence Sets for Network Structure

Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1

2 0.66056508 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

Author: Hai-son P. Le, Ziv Bar-joseph

Abstract: Determining interactions between entities and the overall organization and clustering of nodes in networks is a major challenge when analyzing biological and social network data. Here we extend the Indian Buffet Process (IBP), a nonparametric Bayesian model, to integrate noisy interaction scores with properties of individual entities for inferring interaction networks and clustering nodes within these networks. We present an application of this method to study how microRNAs regulate mRNAs in cells. Analysis of synthetic and real data indicates that the method improves upon prior methods, correctly recovers interactions and clusters, and provides accurate biological predictions. 1

3 0.65184355 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion

Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.

4 0.61549443 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

Author: Jun Zhu, Ning Chen, Eric P. Xing

Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.

5 0.53544974 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta

Abstract: In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The specific example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We define a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also refinement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer affiliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conflicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to find the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efficiently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be specified by experts. For instance, the influence of one variable on another, or the combined influence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the specification of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly specified by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Influence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] defines the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisfied: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to refine the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the definition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisfied, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this definition of the CPT for V is that a constraint is more likely to be satisfied if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully defined given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisfied (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisfied (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modified versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efficient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisfied (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difficult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisfied by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to find a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to find the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us fix the leak parameters and the link reliability from the first cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to find M MAP . To evaluate M MAP , we first compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 confirms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a fixed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to find the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to find parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identified with a symptom represented by the first test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model refinement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the first approach that exploits implicit constraints based on value of information assessments. Secondly it is the first approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restificar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative influences. In Proceedings of Twenty Third International Conference on Uncertainty in Artificial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratification in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, July 2000. 9

6 0.48552322 288 nips-2011-Thinning Measurement Models and Questionnaire Design

7 0.46846816 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

8 0.455971 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty

9 0.45575291 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

10 0.44834504 285 nips-2011-The Kernel Beta Process

11 0.44806582 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

12 0.44578442 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors

13 0.44321373 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

14 0.43948412 217 nips-2011-Practical Variational Inference for Neural Networks

15 0.43755412 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

16 0.43336725 250 nips-2011-Shallow vs. Deep Sum-Product Networks

17 0.42728207 258 nips-2011-Sparse Bayesian Multi-Task Learning

18 0.42660168 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

19 0.41914397 150 nips-2011-Learning a Distance Metric from a Network

20 0.41103789 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.012), (4, 0.034), (20, 0.018), (26, 0.029), (31, 0.081), (33, 0.033), (43, 0.081), (45, 0.098), (56, 0.356), (57, 0.048), (65, 0.013), (74, 0.044), (83, 0.046), (84, 0.01), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81032413 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

Author: Michael C. Mozer, Benjamin Link, Harold Pashler

Abstract: Psychologists have long been struck by individuals’ limitations in expressing their internal sensations, impressions, and evaluations via rating scales. Instead of using an absolute scale, individuals rely on reference points from recent experience. This relativity of judgment limits the informativeness of responses on surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that map stimuli to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, the problem is to infer latent (subjective) impressions from a sequence of stimulus labels (e.g., movie names) and responses. We describe an unsupervised approach that simultaneously recovers the impressions and parameters of a contamination model that predicts how recent judgments affect the current response. We test our iterated impression inference, or I3 , algorithm in three domains: rating the gap between dots, the desirability of a movie based on an advertisement, and the morality of an action. We demonstrate significant objective improvements in the quality of the recovered impressions. 1

same-paper 2 0.74716526 60 nips-2011-Confidence Sets for Network Structure

Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1

3 0.65626574 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1

4 0.61091244 67 nips-2011-Data Skeletonization via Reeb Graphs

Author: Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang

Abstract: Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a lowdimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional ”skeleton” from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.

5 0.47529697 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1

6 0.46522722 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database

7 0.45970669 15 nips-2011-A rational model of causal inference with continuous causes

8 0.44757769 35 nips-2011-An ideal observer model for identifying the reference frame of objects

9 0.44665292 258 nips-2011-Sparse Bayesian Multi-Task Learning

10 0.44214746 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

11 0.4420191 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

12 0.4416461 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

13 0.44150981 24 nips-2011-Active learning of neural response functions with Gaussian processes

14 0.44117785 219 nips-2011-Predicting response time and error rates in visual search

15 0.43983901 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

16 0.43859565 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

17 0.43840873 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

18 0.43784156 276 nips-2011-Structured sparse coding via lateral inhibition

19 0.43686175 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

20 0.43661898 180 nips-2011-Multiple Instance Filtering