nips nips2011 nips2011-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion
Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. [sent-12, score-0.371]
2 Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. [sent-13, score-0.585]
3 1 Introduction The analysis of the structure and evolution of network data is an increasingly important task in a variety of disciplines, including biology and engineering. [sent-16, score-0.149]
4 The emergence and growth of largescale online social networks also provides motivation for the development of longitudinal models for networks over time. [sent-17, score-0.363]
5 While in many cases the data for an evolving network are recorded on a continuous time scale, a common approach is to analyze “snapshot” data (also known as collapsed panel data), where multiple cross-sectional snapshots of the network are recorded at discrete time points. [sent-18, score-0.314]
6 Various statistical frameworks have been previously proposed for discrete snapshot data, including dynamic versions of exponential random graph models [1, 2, 3] as well as dynamic block models and matrix factorization methods [4, 5]. [sent-19, score-0.126]
7 In contrast, there is relatively little work to date on continuous-time models for large-scale longitudinal networks. [sent-20, score-0.172]
8 In this paper, we propose a general regression-based modeling framework for continuous-time network event data. [sent-21, score-0.193]
9 Our methods are inspired by survival and event history analysis [6, 7]; specifically, we employ multivariate counting processes to model the edge dynamics of the network. [sent-22, score-0.293]
10 Building on recent work in this context [8, 9], we use both multiplicative and additive intensity functions that allow for the incorporation of arbitrary time-dependent network statistics; furthermore, we consider ∗ current affiliation: Google Inc. [sent-23, score-0.249]
11 1 time-varying regression coefficients for the additive approach. [sent-24, score-0.111]
12 The additive form in particular enables us to develop an efficient online inference scheme for estimating the time-varying coefficients of the model, allowing the approach to scale to large networks. [sent-25, score-0.122]
13 On synthetic and real-world data, we show that the proposed scheme accurately estimates these coefficients and that the learned model is useful for both interpreting the evolution of the network and predicting future network events. [sent-26, score-0.329]
14 The next section introduces the general regression framework and the associated inference scheme is described in detail in Section 3. [sent-28, score-0.09]
15 2 Regression models for continuous-time network data Below we introduce multiplicative and additive regression models for the edge formation process in a longitudinal network. [sent-31, score-0.626]
16 We also describe non-recurrent event models and give examples of timedependent statistics in this context. [sent-32, score-0.129]
17 1 General framework Assume in our network that nodes arrive according to some stochastic process and directed edges among these nodes are created over time. [sent-34, score-0.309]
18 Given the ordered pair (i, j) of nodes in the network at time t, let Nij (t) be a counting process denoting the number of edges from i to j up to time t. [sent-35, score-0.362]
19 Combining the individual counting processes of all potential edges gives a multivariate counting process N(t) = (Nij (t) : i, j ∈ {1, . [sent-37, score-0.26]
20 n}, i = j); we make no assumption about the independence of individual edge counting processes. [sent-40, score-0.149]
21 ) We do not consider an edge dissolution process in this paper, although in theory it is possible to do so by placing a second counting process on each edge for dissolution events. [sent-42, score-0.345]
22 ) As proposed in [9], we model the multivariate counting process via the Doob-Meyer decomposition [7], t N(t) = λ(s) ds + M(t), (1) 0 where essentially λ(t) and M(t) may be viewed as the (deterministic) signal and (martingale) noise, respectively. [sent-44, score-0.111]
23 In equations (2) and (3), s(i, j, t) is a vector of p statistics for directed edge (i, j) constructed based on Ht− ; examples of these statistics are given in Section 2. [sent-49, score-0.154]
24 In each of the two models, the intensity process depends on a linear combination of the coefficients β, which can be time-varying in the additive Aalen formulation. [sent-51, score-0.129]
25 2 Non-recurrent event models for network formation processes If tarr and tarr are the arrival times of nodes i and j, then the risk indicator of equations (2) and (3) i j is Yij (t) = I max(tarr , tarr ) < t ≤ teij . [sent-60, score-0.524]
26 The time teij of directed edge (i, j) is taken to be +∞ j i if the edge is never formed during the observation time. [sent-61, score-0.224]
27 The reason for the upper bound teij is that the counting process is non-recurrent; i. [sent-62, score-0.151]
28 , formation of an edge means that it can never occur again. [sent-64, score-0.151]
29 The network statistics s(i, j, t) of equations (2) and (3), corresponding to the ordered pair (i, j), can be time-invariant (such as gender match) or time-dependent (such as the number of two-paths from i to j just before time t). [sent-65, score-0.158]
30 Since it has been found empirically that most new edges in social networks are created between nodes separated by two hops [11], we limit our statistics to the following: 1. [sent-66, score-0.251]
31 Out-degree of sender i: s1 (i, j, t) = h∈V,h=i Nih (t− ) 2. [sent-67, score-0.093]
32 In-degree of sender i: s2 (i, j, t) = h∈V,h=i Nhi (t− ) 3. [sent-68, score-0.093]
33 Shared contacters: s9 (i, j, t) = h∈V,h=i,j Nhi (t− )Nhj (t− ) Here Nji (t− ) denotes the value of the counting process (i, j) right before time t. [sent-75, score-0.131]
34 Such models are useful for data where interaction edges occur multiple times (e. [sent-77, score-0.084]
35 3 Inference techniques In this section, we describe algorithms for estimating the coefficients of the multiplicative Cox and additive Aalen models. [sent-80, score-0.091]
36 We also discuss an efficient online inference technique for the Aalen model. [sent-81, score-0.066]
37 1 Estimation for the Cox model Recent work has posited Cox models similar to (2) with the goal of estimating general network effects [8, 12] or citation network effects [9]. [sent-83, score-0.277]
38 We will illustrate this method through the computation of the likelihood, where the most expensive computation is for the denominator n Yij (te ) exp β ⊤ s(i, j, te ) . [sent-88, score-0.697]
39 A na¨ve calculation of log L(β) needs O(mpn2 ) operations ı (where m is the number of edge events), which is costly since m and n may be large. [sent-91, score-0.064]
40 3 Alternatively, as in [9], we may simply write κ(te ) = κ(te−1 ) + ∆κ(te ), where ∆κ(te ) entails all of the possible changes that occur during the time interval [te−1 , te ). [sent-93, score-0.717]
41 Since we assume in this paper that edges do not dissolve, it is necessary to keep track only of the group of edges whose covariates change during this interval, which we call Ue−1 , and those that first become at risk during this interval, which we call Ce−1 . [sent-94, score-0.128]
42 These groups of edges may be cached in memory during an initialization step; then, subsequent calculations of ∆κ(te ) are simple functions of the values of s(i, j, te−1 ) and s(i, j, te ) for (i, j) in these two groups (for Ce−1 , only the time te statistic is relevant). [sent-95, score-1.498]
43 The number of edges cached at each time step tends to be small, generally O(n) because our network statistics s are limited to those based on node degrees and two-paths. [sent-96, score-0.242]
44 The most expensive computation is to update the (p + 1) × (p + 1) matrix A(te ) at every event time te ; inverting A(te ) is not expensive since p is relatively small. [sent-113, score-0.799]
45 2, if n is the current number of nodes, the cost of na¨vely calculating Akl (te ) by iterating through all “at-risk” edges is nearly n2 . [sent-117, score-0.064]
46 In other cases, there may be restrictions on the set of edges at risk at a particular time. [sent-120, score-0.064]
47 Our online inference algorithm during the time interval [te−1 , te ) may be summarized as follows: 1. [sent-122, score-0.783]
48 Compute and cache the network statistics changed by the event e − 1, then initialize Ue−1 with a list of those at-risk edges whose network statistics are changed by this event. [sent-127, score-0.442]
49 Compute and cache all values of network statistics changed during the time interval [te−1 , te ). [sent-129, score-0.875]
50 Define Ce−1 as the set of edges that switch to at-risk during this interval. [sent-130, score-0.064]
51 Before considering the event e: (a) Compute look-ahead summations at time te−1 indexed by Ue−1 . [sent-132, score-0.123]
52 (c) Compute forward summations at time te indexed by Ue−1 and Ce−1 . [sent-134, score-0.738]
53 Assuming that the number n of nodes stays roughly the same over each of the m edge events, the overall computational complexity of this online inference algorithm ˆ is thus O(p2 n2 + m(p2 n + p3 )). [sent-136, score-0.166]
54 4 Experimental analysis In this section, we empirically analyze the ability of our inference methods to estimate the regression coefficients as well as the predictive power of the learned models. [sent-138, score-0.125]
55 In particular, we simulate a network formation process starting from time unit 0 until time 1200, where nodes arrive in the network at a constant rate λ0 = 10 (i. [sent-141, score-0.411]
56 , on average, 10 nodes join the network at each time unit); the resulting simulated networks have 11,997 nodes. [sent-143, score-0.203]
57 The edge formation process is simulated via Otaga’s modified thinning algorithm [14] with an additive conditional intensity function. [sent-144, score-0.28]
58 For SIM-2, between times 1000 and 1150, we increase the coefficients for transitivity and shared contacters to β6 = β9 = 4 × 10−5, and after 1150, the coefficients return to their original values; in this case, 127,590 edges are created. [sent-147, score-0.327]
59 IRVINE is a longitudinal data set derived from an online social network of students at UC Irvine [15]. [sent-149, score-0.382]
60 This dataset has 1,899 users and 20,296 directed contact edges between users, with timestamps for each node arrival and edge creation event. [sent-150, score-0.238]
61 This longitudinal network spans April to October of 2004. [sent-151, score-0.263]
62 This dataset has 51,362 users and 76,791 directed contact edges between users. [sent-153, score-0.155]
63 Note that both data sets are non-recurrent in that the creation of an edge between two nodes only occurs at most once. [sent-155, score-0.1]
64 These plots suggest that there are two distinct phases of network evolution, consistent with an independent analysis of these data [15]. [sent-163, score-0.131]
65 Here, the network effects continuously change during the observation time. [sent-165, score-0.111]
66 1 Recovering the time-varying regression coefficients This section focuses on the ability of our additive Aalen modeling approach to estimate the timevarying coefficients, given an observed longitudinal network. [sent-167, score-0.263]
67 3 and use an Epanechnikov smoothing kernel (with a bandwidth of 10 time units) to obtain smoothed coefficients. [sent-170, score-0.066]
68 On SIM-1, Figures 1(a,b) show the estimated coefficients associated with the transitivity and shared contacters statistics, as well as the ground-truth coefficients. [sent-171, score-0.285]
69 On the IRVINE and METAFILTER data, we also learn time-varying coefficients which are useful for interpreting network evolution. [sent-177, score-0.141]
70 In the first phase of network formation, the network grows at an accelerated rate. [sent-180, score-0.222]
71 Positive coefficients for sender outdegree, reciprocity, and transitivity in these plots imply that users with a high numbers of friends tend to make more friends, tend to reciprocate their relations, and tend to make friends with their friends’ friends, respectively. [sent-181, score-0.366]
72 However, these coefficients decrease towards zero (the blue line) and enter a second phase where the network is structurally stable. [sent-182, score-0.111]
73 Interestingly, the coefficients suggest that there is a marked change in the edge formation process around 7/10/10. [sent-185, score-0.177]
74 2 Predicting future links We perform rolling prediction experiments over the real-world data sets to evaluate the predictive power of the learned regression models. [sent-190, score-0.115]
75 Following the evaluation methodology of [9], we split each longitudinal data set into three periods: a statistics-building period, a training period, and a test period (Table 1). [sent-191, score-0.199]
76 The statistics-building period is used solely to build up the network statistics, while the training period is used to learn the coefficients and the test period is used to make predictions. [sent-192, score-0.252]
77 Furthermore, for the additive Aalen model, we use the online inference technique from Section 3. [sent-194, score-0.122]
78 When we predict an event in the test period, all the previous events from the test period are used as training data as well. [sent-196, score-0.159]
79 The baseline that we consider is logistic regression (LR) with the same time-dependent statistics used in the Aalen and Cox models. [sent-201, score-0.111]
80 Note that logistic regression is a competitive baseline that has been used in previous link prediction studies (e. [sent-202, score-0.123]
81 We also use case control sampling to address the imbalance between positive and negative cases (since at each “positive” edge event there are order of n2 “negative” training cases). [sent-206, score-0.146]
82 To make predictions using the additive Aalen model, one would need to extrapolate the time-varying coefficients to future time points. [sent-209, score-0.076]
83 We evaluate the predictive performance of the Aalen model (with smoothing windows of 1 and 10), the Cox model, and the LR baseline (with case control ratios 1:10 and 1:50). [sent-213, score-0.084]
84 5 Related Work and Conclusions Evolving networks have been descriptively analyzed in exploratory fashion in a variety of domains, including email data [16], citation graphs [17], and online social networks [18]. [sent-231, score-0.247]
85 Such methods operate on cross-sectional snapshot data, while our framework models continuous-time network event data. [sent-233, score-0.243]
86 It is worth noting that continuous-time Markov process models for longitudinal networks have been proposed previously [21]; however, these approaches have only been applied to very small networks, while our regression-based approach can scale to large networks. [sent-234, score-0.234]
87 Recently, there has also been work on inferring unobserved time-varying networks from evolving nodal attributes which are observed [22, 23, 24]. [sent-235, score-0.123]
88 More recently, survival and event history models based on the Cox model have been applied to network data [8, 12, 9]. [sent-237, score-0.275]
89 A significant difference between our previous work [9] and this paper is that scalability is achieved in our earlier work by restricting the approach to “egocentric” modeling, in which counting processes are placed only on nodes. [sent-238, score-0.085]
90 In contrast, here we formulate scalable inference techniques for the general “relational” setting where counting processes are placed on edges. [sent-239, score-0.12]
91 Prior work also assumed static regression coefficients, while here we develop a framework for time-varying coefficients for the additive Aalen model. [sent-240, score-0.111]
92 Regression models with varying coefficients have been previously proposed in other contexts [25], including a time-varying version of the Cox model [26], although to the best of our knowledge such models have not been developed or fitted on longitudinal networks. [sent-241, score-0.192]
93 While our focus is not on feature engineering, we note that arbitrary network and nodal features such as those developed for link prediction can be incorporated into our continuous-time regression framework. [sent-246, score-0.24]
94 In summary, we have developed multiplicative and additive regression models for large-scale continuous-time longitudinal networks. [sent-249, score-0.318]
95 On simulated and real-world data, we have shown that the proposed inference approach can accurately estimate regression coefficients and that the learned model can be used for interpreting network evolution and predicting future network events. [sent-250, score-0.419]
96 Discovering long range properties of social networks with multivalued time-inhomogeneous models. [sent-266, score-0.124]
97 A dynamic relational infinite feature model for longitudinal social networks. [sent-297, score-0.287]
98 Supervised random walks: Predicting and recommending links in social networks. [sent-347, score-0.113]
99 Recovering time-varying networks of dependencies in social and biological studies. [sent-448, score-0.124]
100 Predicting positive and negative links in online social networks. [sent-503, score-0.144]
wordName wordTfidf (topN-words)
[('te', 0.697), ('aalen', 0.348), ('cox', 0.208), ('cients', 0.177), ('coef', 0.156), ('longitudinal', 0.152), ('coefficient', 0.139), ('metafilter', 0.119), ('transitivity', 0.119), ('irvine', 0.118), ('network', 0.111), ('contacters', 0.106), ('lr', 0.098), ('sender', 0.093), ('social', 0.088), ('formation', 0.087), ('counting', 0.085), ('event', 0.082), ('ue', 0.07), ('reciprocity', 0.066), ('edges', 0.064), ('edge', 0.064), ('friends', 0.06), ('yij', 0.059), ('akl', 0.058), ('additive', 0.056), ('regression', 0.055), ('evolving', 0.052), ('period', 0.047), ('intensity', 0.047), ('epanechnikov', 0.046), ('receiver', 0.044), ('tarr', 0.043), ('survival', 0.042), ('dissolution', 0.04), ('nhi', 0.04), ('nhj', 0.04), ('nij', 0.04), ('njh', 0.04), ('teij', 0.04), ('link', 0.039), ('shared', 0.038), ('evolution', 0.038), ('networks', 0.036), ('nodes', 0.036), ('directed', 0.036), ('inference', 0.035), ('nodal', 0.035), ('citation', 0.035), ('predictive', 0.035), ('multiplicative', 0.035), ('users', 0.034), ('asuncion', 0.034), ('bk', 0.033), ('online', 0.031), ('na', 0.031), ('ce', 0.031), ('snapshot', 0.03), ('interpreting', 0.03), ('events', 0.03), ('baseline', 0.029), ('dynamic', 0.028), ('caching', 0.027), ('statistics', 0.027), ('borgan', 0.026), ('contactees', 0.026), ('egocentric', 0.026), ('nji', 0.026), ('testset', 0.026), ('wijk', 0.026), ('wijl', 0.026), ('bandwidth', 0.026), ('process', 0.026), ('links', 0.025), ('ht', 0.025), ('nih', 0.024), ('estimated', 0.022), ('acm', 0.021), ('email', 0.021), ('blockmodel', 0.021), ('contact', 0.021), ('hunter', 0.021), ('je', 0.021), ('summations', 0.021), ('time', 0.02), ('accurately', 0.02), ('models', 0.02), ('cached', 0.02), ('cache', 0.02), ('ahmed', 0.02), ('hazard', 0.02), ('history', 0.02), ('smoothing', 0.02), ('phases', 0.02), ('song', 0.02), ('predicting', 0.019), ('wij', 0.019), ('relational', 0.019), ('arrival', 0.019), ('ie', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion
Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.
2 0.16236921 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
Author: Hsiu-chin Lin, Vickie Baracos, Russell Greiner, Chun-nam J. Yu
Abstract: An accurate model of patient survival time can help in the treatment and care of cancer patients. The common practice of providing survival time estimates based only on population averages for the site and stage of cancer ignores many important individual differences among patients. In this paper, we propose a local regression method for learning patient-specific survival time distribution based on patient attributes such as blood tests and clinical assessments. When tested on a cohort of more than 2000 cancer patients, our method gives survival time predictions that are much more accurate than popular survival analysis models such as the Cox and Aalen regression models. Our results also show that using patient-specific attributes can reduce the prediction error on survival time by as much as 20% when compared to using cancer site and stage only. 1
3 0.085453957 60 nips-2011-Confidence Sets for Network Structure
Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi
Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1
4 0.075158127 275 nips-2011-Structured Learning for Cell Tracking
Author: Xinghua Lou, Fred A. Hamprecht
Abstract: We study the problem of learning to track a large quantity of homogeneous objects such as cell tracking in cell culture study and developmental biology. Reliable cell tracking in time-lapse microscopic image sequences is important for modern biomedical research. Existing cell tracking methods are usually kept simple and use only a small number of features to allow for manual parameter tweaking or grid search. We propose a structured learning approach that allows to learn optimum parameters automatically from a training set. This allows for the use of a richer set of features which in turn affords improved tracking compared to recently reported methods on two public benchmark sequences. 1
5 0.054790132 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
6 0.050344273 8 nips-2011-A Model for Temporal Dependencies in Event Streams
7 0.045194857 101 nips-2011-Gaussian process modulated renewal processes
8 0.044444054 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
9 0.043316193 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
10 0.040906738 131 nips-2011-Inference in continuous-time change-point models
11 0.040114574 276 nips-2011-Structured sparse coding via lateral inhibition
12 0.039769672 250 nips-2011-Shallow vs. Deep Sum-Product Networks
13 0.039593771 26 nips-2011-Additive Gaussian Processes
14 0.038692538 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
15 0.03745316 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
16 0.037254412 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
17 0.035967957 150 nips-2011-Learning a Distance Metric from a Network
18 0.035671804 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
19 0.035659235 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.034689523 302 nips-2011-Variational Learning for Recurrent Spiking Networks
topicId topicWeight
[(0, 0.111), (1, 0.024), (2, 0.013), (3, -0.011), (4, -0.019), (5, -0.055), (6, -0.014), (7, -0.014), (8, 0.02), (9, -0.002), (10, -0.043), (11, -0.029), (12, 0.011), (13, -0.008), (14, -0.02), (15, 0.005), (16, -0.008), (17, -0.014), (18, 0.01), (19, -0.012), (20, -0.005), (21, 0.031), (22, 0.018), (23, 0.042), (24, -0.018), (25, 0.044), (26, -0.072), (27, -0.075), (28, -0.018), (29, 0.009), (30, 0.051), (31, 0.006), (32, 0.006), (33, 0.111), (34, -0.015), (35, -0.099), (36, 0.001), (37, 0.106), (38, 0.049), (39, -0.014), (40, 0.035), (41, 0.001), (42, -0.028), (43, 0.104), (44, 0.048), (45, 0.049), (46, 0.025), (47, 0.056), (48, -0.027), (49, 0.219)]
simIndex simValue paperId paperTitle
same-paper 1 0.92017502 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion
Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.
2 0.86509192 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
Author: Hsiu-chin Lin, Vickie Baracos, Russell Greiner, Chun-nam J. Yu
Abstract: An accurate model of patient survival time can help in the treatment and care of cancer patients. The common practice of providing survival time estimates based only on population averages for the site and stage of cancer ignores many important individual differences among patients. In this paper, we propose a local regression method for learning patient-specific survival time distribution based on patient attributes such as blood tests and clinical assessments. When tested on a cohort of more than 2000 cancer patients, our method gives survival time predictions that are much more accurate than popular survival analysis models such as the Cox and Aalen regression models. Our results also show that using patient-specific attributes can reduce the prediction error on survival time by as much as 20% when compared to using cancer site and stage only. 1
3 0.5537557 60 nips-2011-Confidence Sets for Network Structure
Author: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi
Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximumlikelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. 1
4 0.49305445 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta
Abstract: In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The specific example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We define a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also refinement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer affiliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conflicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to find the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efficiently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be specified by experts. For instance, the influence of one variable on another, or the combined influence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the specification of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly specified by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Influence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] defines the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisfied: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to refine the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the definition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisfied, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this definition of the CPT for V is that a constraint is more likely to be satisfied if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully defined given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisfied (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisfied (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modified versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efficient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisfied (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difficult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisfied by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to find a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to find the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us fix the leak parameters and the link reliability from the first cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to find M MAP . To evaluate M MAP , we first compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 confirms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a fixed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to find the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to find parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identified with a symptom represented by the first test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model refinement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the first approach that exploits implicit constraints based on value of information assessments. Secondly it is the first approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restificar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative influences. In Proceedings of Twenty Third International Conference on Uncertainty in Artificial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratification in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, July 2000. 9
5 0.4313511 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama
Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1
6 0.42881536 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
7 0.42558971 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
8 0.42420796 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
9 0.41029006 150 nips-2011-Learning a Distance Metric from a Network
10 0.40282184 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
11 0.38194209 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
12 0.37593663 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
13 0.37560672 143 nips-2011-Learning Anchor Planes for Classification
14 0.3682614 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression
15 0.35935378 175 nips-2011-Multi-Bandit Best Arm Identification
16 0.35139319 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
17 0.35081857 101 nips-2011-Gaussian process modulated renewal processes
18 0.34984043 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
19 0.34569132 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
20 0.34501284 69 nips-2011-Differentially Private M-Estimators
topicId topicWeight
[(0, 0.021), (4, 0.048), (20, 0.018), (23, 0.332), (26, 0.024), (31, 0.086), (33, 0.012), (43, 0.06), (45, 0.075), (56, 0.011), (57, 0.036), (74, 0.065), (83, 0.037), (99, 0.059)]
simIndex simValue paperId paperTitle
1 0.78441525 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games
Author: Richard G. Gibson, Duane Szafron
Abstract: Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from abstract games perform better in the real game as the granularity of abstraction is increased. This paper investigates two techniques for stitching a base strategy in a coarse abstraction of the full game tree, to expert strategies in fine abstractions of smaller subtrees. We provide a general framework for creating static experts, an approach that generalizes some previous strategy stitching efforts. In addition, we show that static experts can create strong agents for both 2-player and 3-player Leduc and Limit Texas Hold’em poker, and that a specific class of static experts can be preferred among a number of alternatives. Furthermore, we describe a poker agent that used static experts and won the 3-player events of the 2010 Annual Computer Poker Competition.
same-paper 2 0.72709012 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion
Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.
3 0.5584448 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
4 0.4456082 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
5 0.44346943 102 nips-2011-Generalised Coupled Tensor Factorisation
Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli
Abstract: We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie’s distributions corresponding to βdivergences. By bounding the step size of the Fisher Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. 1
6 0.44145894 186 nips-2011-Noise Thresholds for Spectral Clustering
7 0.44114763 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
8 0.43861651 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
9 0.43852463 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
10 0.43783671 66 nips-2011-Crowdclustering
11 0.43770289 75 nips-2011-Dynamical segmentation of single trials from population neural data
12 0.43733832 276 nips-2011-Structured sparse coding via lateral inhibition
13 0.43671104 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
14 0.43633744 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
15 0.43594062 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
16 0.43438715 180 nips-2011-Multiple Instance Filtering
17 0.43413097 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
18 0.43405727 301 nips-2011-Variational Gaussian Process Dynamical Systems
19 0.4329944 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
20 0.43155602 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines