nips nips2006 nips2006-96 knowledge-graph by maker-knowledge-mining

96 nips-2006-In-Network PCA and Anomaly Detection

Source: pdf

Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft

Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data ﬁlters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We consider the problem of network anomaly detection in large distributed systems. [sent-14, score-0.609]

2 In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. [sent-15, score-0.481]

3 To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data ﬁlters send to a coordinator just enough data to enable accurate global detection. [sent-18, score-0.879]

4 Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network. [sent-19, score-0.673]

5 Consider, for example, the problem of detecting anomalies in a wide-area network. [sent-25, score-0.315]

6 While it is straightforward to embed learning algorithms at local nodes to attempt to detect node-level anomalies, these anomalies may not be indicative of network-level problems. [sent-26, score-0.338]

7 They showed that the minor components of PCA (the subspace obtained after removing the components with largest eigenvalues) revealed anomalies that were not detectable in any single node-level trace. [sent-28, score-0.364]

8 Such a solution cannot scale either for networks with a large number of monitors nor for networks seeking to track and detect anomalies at very small time scales. [sent-30, score-0.551]

9 The key underlying problem is that of developing a mathematical understanding of how to trade off quantization arising from local data ﬁltering against ﬁdelity of the detection analysis. [sent-33, score-0.225]

10 We also need to understand how this tradeoff impacts overall detection accuracy. [sent-34, score-0.235]

11 In this paper, we present a simple algorithmic framework for network-wide anomaly detection that relies on distributed tracking combined with approximate PCA analysis, together with supporting theoretical analysis. [sent-36, score-0.612]

12 In brief, the architecture involves a set of local monitors that maintain parameterized sliding ﬁlters. [sent-37, score-0.274]

13 The coordinator makes global decisions based on these quantized data streams. [sent-39, score-0.493]

14 We use stochastic matrix perturbation theory to both assess the impact of quantization on the accuracy of anomaly detection, and to design a method that selects ﬁlter parameters in a way that bounds the detection error. [sent-40, score-0.629]

15 The combination of our theoretical tools and local ﬁltering strategies results in an in-network tracking algorithm that can achieve high detection accuracy with low communication overhead; for instance, our experiments show that, by choosing a relative eigen-error of 1. [sent-41, score-0.424]

16 5% (yielding, approximately, a 4% missed detection rate and a 6% false alarm rate), we can ﬁlter out more than 90% of the trafﬁc from the original signal. [sent-42, score-0.66]

17 [8] has been extended by [17], who show how to infer network anomalies in both spatial and temporal domains. [sent-45, score-0.33]

18 Other initiatives in distributed monitoring, proﬁling and anomaly detection aim to share information and foster collaboration between widely distributed monitoring boxes to offer improvements over isolated systems [12, 16]. [sent-49, score-0.711]

19 Work in [2, 10] posits the need for scalable detection of network attacks and intrusions. [sent-50, score-0.275]

20 2 Problem description and background We consider a monitoring system comprising a set of local monitor nodes M 1 , . [sent-54, score-0.308]

21 For instance, the monitors may collect information on the number of TCP connection requests per second, the number of DNS transactions per minute, or the volume of trafﬁc at port 80 per second. [sent-59, score-0.245]

22 A central coordinator node aims to continuously monitor the global collection of time series, and make global decisions such as those concerning matters of network-wide health. [sent-60, score-0.779]

23 A volume anomaly refers to unusual trafﬁc load levels in a network that are caused by anomalies such as worms, distributed denial of service attacks, device failures, misconﬁgurations, and so on. [sent-62, score-0.735]

24 Each monitor collects a new data point at every time step and, assuming a naive, “continuous push” protocol, sends the new point to the coordinator. [sent-63, score-0.255]

25 Based on these updates, the coordinator keeps track of a sliding time window of size m (i. [sent-64, score-0.481]

26 , the m most recent data points) for each monitor time series, organized into a matrix Y of size m × n (where the ith column Yi captures the data from monitor i, see Fig. [sent-66, score-0.386]

27 The coordinator then makes its decisions based solely on this (global) Y matrix. [sent-68, score-0.439]

28 In the network-wide volume anomaly detection algorithm of [8] the local monitors measure the total volume of trafﬁc (in bytes) on each network link, and periodically (e. [sent-69, score-0.854]

29 The coordinator then performs PCA on the assembled Y matrix to detect volume anomalies. [sent-72, score-0.471]

30 However, such a “periodic push” approach suffers from inherent limitations: To ensure fast detection, the update periods should be relatively small; unfortunately, small periods also imply increased monitoring communication overheads, which may very well be unnecessary (e. [sent-74, score-0.277]

31 Instead, in our work, we study how the monitors can effectively ﬁlter their time-series updates, sending as little data as possible, yet enough so as to allow the coordinator to make global decisions accurately. [sent-77, score-0.696]

32 5 0 Mon (b) Abilene network trafﬁc data (a) The system setup Figure 1: (a) The distributed monitoring system; (b) Data sample ( y 2 ) collected over one week (top); its projection in residual subspace (bottom). [sent-81, score-0.379]

33 [8], due to the high level of trafﬁc aggregation on ISP backbone links, volume anomalies can often go unnoticed by being “buried” within normal trafﬁc patterns (e. [sent-85, score-0.402]

34 This subspace is referred to as the normal trafﬁc subspace Sno . [sent-95, score-0.234]

35 The remaining (n − k) principal components constitute the abnormal trafﬁc subspace S ab . [sent-96, score-0.254]

36 Mathematically, yno (t) and yab (t) can be computed as yno (t) = PPT y(t) = Cno y(t) and yab (t) = (I − PPT )y(t) = Cab y(t) where P = [v1 , v2 , . [sent-98, score-0.3]

37 The matrix Cno = PPT represents the linear operator that performs projection onto the normal subspace Sno , and Cab projects onto the abnormal subspace Sab . [sent-102, score-0.345]

38 As observed in [8], a volume anomaly typically results in a large change to y ab ; thus, a useful metric for detecting abnormal trafﬁc patterns is the squared prediction error (SPE): SPE ≡ yab 2 = Cab y 2 (essentially, a quadratic residual function). [sent-103, score-0.634]

39 More formally, their proposed algorithm signals a volume anomaly if SPE > Qα , where Qα denotes the threshold statistic for the SPE residual function at the 1 − α conﬁdence level. [sent-104, score-0.46]

40 , δ n Coordinator Figure 2: Our in-network tracking and detection framework. [sent-118, score-0.249]

41 3 In-network PCA for anomaly detection We now describe our version of an anomaly detector that uses distributed tracking and approximate PCA analysis. [sent-119, score-0.94]

42 A key idea is to curtail the amount of data each monitor sends to the coordinator. [sent-120, score-0.288]

43 Because our job is to catch anomalies, rather than to track ongoing state, we point out that the coordinator only needs to have a good approximation of the state when an anomaly is near. [sent-121, score-0.731]

44 This observation makes it intuitive that a reduction in data sharing between monitors and the coordinator should be possible. [sent-123, score-0.603]

45 We curtail the amount of data ﬂow from monitors to the coordinator by installing local ﬁlters at each monitor. [sent-124, score-0.67]

46 These ﬁlters maintain a local constraint, and a monitor only sends the coordinator an update of its data when the constraint is violated. [sent-125, score-0.721]

47 The coordinator thus receives an approximate, or “perturbed,” view of the data stream at each monitor and hence of the global state. [sent-126, score-0.647]

48 We use stochastic matrix perturbation theory to analyze the effect on our PCA-based anomaly detector of using a perturbed global matrix. [sent-127, score-0.596]

49 , the local constraints) so as to limit the effect of the perturbation on the PCA analysis and on any deterioration in the anomaly detector’s performance. [sent-130, score-0.472]

50 The goal of a monitor is to track its local raw time-series data, and to decide when the coordinator needs an update. [sent-136, score-0.671]

51 Intuitively, if the time series does not change much, or doesn’t change in a way that affects the global condition being tracked, then the monitor does not send anything to the coordinator. [sent-137, score-0.31]

52 The coordinator assumes that the most recently received update is still approximately valid. [sent-138, score-0.432]

53 The update serves as a prediction of the future data, because should the monitor send nothing in subsequent time intervals, then the coordinator uses the most recently received update to predict the missing values. [sent-140, score-0.72]

54 For our anomaly detection application, we ﬁlter as follows. [sent-141, score-0.478]

55 At each time t, the monitor sends both Yi (t) and Ri (t) to the coordinator only if Yi (t) ∈ Fi , otherwise it / sends nothing. [sent-145, score-0.717]

56 The window parameter δi is called the slack; it captures the amount the time series can drift before an update to the coordinator needs to be sent. [sent-146, score-0.432]

57 The monitor needs to send both Y i (t∗ ) and Ri (t∗ ) to the coordinator when it does an update, because the coordinator will use Yi (t∗ ) at time t∗ and Ri (t∗ ) for all t > t∗ until the next update arrives. [sent-150, score-1.088]

58 For any subsequent t > t∗ when the coordinator receives no update from that monitor, it will use Ri (t∗ ) as the prediction for Yi (t). [sent-151, score-0.432]

59 , the slacks δi ) for all the monitors based on its view of the global state and the condition for triggering an anomaly. [sent-156, score-0.29]

60 It gives the monitors their slacks initially and updates the value of their slack parameters when needed. [sent-157, score-0.404]

61 The global detection task is the same as in the centralized scheme. [sent-160, score-0.307]

62 In contrast to the centralized setting, however, the coordinator does not have ˆ an exact version of the raw data matrix Y; it has the approximation Y instead. [sent-161, score-0.462]

63 The magnitude of the perturbation matrix ∆ is determined by the slack variables δ i (i = 1, . [sent-163, score-0.29]

64 This choice is critical because these parameters balance the tradeoff between the savings in data communication and the loss of detection accuracy. [sent-169, score-0.335]

65 Clearly, the larger the slack, the less the monitor needs to send, thus leading to both more reduction in communication overhead and potentially more information loss at the coordinator. [sent-170, score-0.374]

66 We employ stochastic matrix perturbation theory to quantify the effects of the perturbation of a matrix on key quantities such as eigenvalues and the eigen-subspaces, which in turn affect the detection accuracy. [sent-171, score-0.555]

67 We derive an upper bound on the changes to the eigenvalues λi and the residual subspace Cab as a function of ∆ . [sent-174, score-0.297]

68 Controlling these latter terms, we are able to bound the false alarm probability. [sent-177, score-0.376]

69 Given the tolerable perturbation T olF , we can use Eqn. [sent-209, score-0.223]

70 For example, we can divide the overall tolerance across monitors either uniformly or in proportion to their observed local variance. [sent-211, score-0.269]

71 3 Guarantee on false alarm probability Because our approximation perturbs the eigenvalues, it also impacts the accuracy with which the trigger is ﬁred. [sent-213, score-0.39]

72 We can compute an upper bound on the perturbation of the SPE statistic, SPE = Cab y 2 , as follows. [sent-215, score-0.219]

73 (5) , 2φ1 φ3 , 3φ2 2 φi = To assess the perturbation in false alarm probability, we start by considering the following random variable c derived from Eqn. [sent-222, score-0.498]

74 The false alarm probability in the centralized system is expressed as Pr Cab y 2 > Qα = Pr [c > cα ] = α, where the lefthand term of this equation is conditioned upon the SPE statistics being inside the ˆ ˆ ˆ normal range. [sent-225, score-0.465]

75 In our distributed setting, the anomaly detector ﬁres a trigger if Cab y 2 > Qα . [sent-226, score-0.447]

76 The deviation of the false alarm probability in our approximate detection scheme can then c be approximated as P (cα − ηc < U < cα + ηc ), where U is a standard normal random variable. [sent-229, score-0.594]

77 There are 7 anomalies in the data that were detected by the centralized algorithm (and veriﬁed by hand to be true anomalies). [sent-235, score-0.337]

78 We also injected 70 synthetic anomalies into this dataset using the method described in [8], so that we would have sufﬁcient data to compute error rates. [sent-236, score-0.275]

79 ) Given this parameter and the input data we can compute the ﬁltering slack δ for the monitors using Eqn. [sent-243, score-0.342]

80 The simulator outputs a set of results including: 1) the actual relative eigen errors and the relative errors on the detection threshold Qα ; 2) the missed detection rate, false alarm rate and communication cost achieved by our method. [sent-307, score-1.156]

81 The missed-detection rate is deﬁned as the fraction of missed detections over the total number of real anomalies, and the false-alarm rate as the fraction of false alarms over the total number of detected anomalies by our protocol, which is α (deﬁned in Sec. [sent-308, score-0.527]

82 3(a) we plot the relationship between the relative eigen-error and the ﬁltering slack δ when assuming ﬁltering errors are uniformly distributed on interval [−δ, δ]. [sent-316, score-0.256]

83 As we increase our error tolerance, we can ﬁlter more at the monitor and send less to the coordinator. [sent-320, score-0.256]

84 In other words, the achieved eigen-error always remains below the requested tolerable error speciﬁed as input, and the slack chosen given the tolerable error is close to being optimal. [sent-328, score-0.283]

85 3(c) shows the relationship between the relative eigen-error and the relative error of detection threshold Qα 2 . [sent-330, score-0.326]

86 We see that the threshold for detecting anomalies decreases as we tolerate more and more eigen-errors. [sent-331, score-0.397]

87 eig / 1 n We now examine the false alarm rates achieved. [sent-333, score-0.387]

88 3(d) the curve with triangles represents the upper bound on the false alarm rate as estimated by the coordinator. [sent-335, score-0.444]

89 The curve with circles is the actual accrued false alarm rate achieved by our scheme. [sent-336, score-0.448]

90 Note that the upper bound on the false alarm rate is fairly close to the true values, especially when the slack is small. [sent-337, score-0.583]

91 The false alarm rate increases with increasing eigen-error because as the eigen-error increases, the corresponding detection threshold Qα will decrease, which in turn causes the protocol to raise an alarm more ˆ often. [sent-338, score-0.915]

92 3(e) that the missed detection rates remain below 4% for various levels of communication overhead. [sent-345, score-0.384]

93 This gain is achieved at the cost of approximately a 4% missed detection rate and a 6% false alarm rate. [sent-352, score-0.66]

94 This is a large reduction in communication for a small increase in detection error. [sent-353, score-0.291]

95 These initial results illustrate that our in-network solution can dramatically lower the communication overhead while still achieving high detection accuracy. [sent-354, score-0.372]

96 5 Conclusion We have presented a new algorithmic framework for network anomaly detection that combines distributed tracking with PCA analysis to detect anomalies with far less data than previous methods. [sent-355, score-0.971]

97 The distributed tracking consists of local ﬁlters, installed at each monitoring site, whose parameters are selected based upon global criteria. [sent-356, score-0.303]

98 The local ﬁltering reduces the amount of data transmitted through the network but also means that anomaly detection must be done with limited or partial views of the global state. [sent-358, score-0.621]

99 Using methods from stochastic matrix perturbation theory, we provided an analysis for the tradeoff between the detection accuracy and the data communication overhead. [sent-359, score-0.486]

100 To the best of our knowledge, this is the ﬁrst result in the literature that provides upper bounds on the false alarm rate of network anomaly detection. [sent-361, score-0.757]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coordinator', 0.4), ('cab', 0.333), ('anomaly', 0.287), ('anomalies', 0.275), ('alarm', 0.246), ('monitors', 0.203), ('monitor', 0.193), ('traf', 0.193), ('detection', 0.191), ('perturbation', 0.151), ('slack', 0.139), ('spe', 0.133), ('false', 0.101), ('communication', 0.1), ('pca', 0.095), ('missed', 0.093), ('subspace', 0.089), ('abilene', 0.083), ('yab', 0.083), ('overhead', 0.081), ('monitoring', 0.081), ('residual', 0.078), ('ltering', 0.077), ('distributed', 0.076), ('accrued', 0.072), ('tolerable', 0.072), ('yno', 0.067), ('send', 0.063), ('perturbed', 0.063), ('eigenvalues', 0.062), ('centralized', 0.062), ('sends', 0.062), ('principal', 0.061), ('tracking', 0.058), ('normal', 0.056), ('network', 0.055), ('ab', 0.055), ('global', 0.054), ('threshold', 0.053), ('ri', 0.051), ('ppt', 0.05), ('sab', 0.05), ('sno', 0.05), ('taft', 0.05), ('abnormal', 0.049), ('protocol', 0.049), ('links', 0.046), ('track', 0.044), ('tradeoff', 0.044), ('olf', 0.043), ('trigger', 0.043), ('volume', 0.042), ('relative', 0.041), ('berkeley', 0.041), ('detector', 0.041), ('detecting', 0.04), ('security', 0.04), ('eig', 0.04), ('eigen', 0.04), ('ling', 0.04), ('decisions', 0.039), ('upper', 0.039), ('continuously', 0.039), ('sliding', 0.037), ('mn', 0.034), ('local', 0.034), ('akhina', 0.033), ('ccs', 0.033), ('cno', 0.033), ('curtail', 0.033), ('fri', 0.033), ('guyen', 0.033), ('iot', 0.033), ('isp', 0.033), ('lakhina', 0.033), ('mon', 0.033), ('ordan', 0.033), ('rovella', 0.033), ('slacks', 0.033), ('thu', 0.033), ('tue', 0.033), ('wed', 0.033), ('xuanlong', 0.033), ('lters', 0.033), ('update', 0.032), ('periods', 0.032), ('infrastructure', 0.032), ('tolerance', 0.032), ('onto', 0.031), ('simulator', 0.03), ('rate', 0.029), ('tolerate', 0.029), ('attacks', 0.029), ('backbone', 0.029), ('feed', 0.029), ('od', 0.029), ('sat', 0.029), ('updates', 0.029), ('detect', 0.029), ('bound', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 96 nips-2006-In-Network PCA and Anomaly Detection

Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft

2 0.22406155 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

Author: Alfred O. Hero

Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of signiﬁcance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the inﬂuence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efﬁciently detect outliers at a given level of signiﬁcance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1

3 0.11405683 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

Author: Silvia Richter, Douglas Aberdeen, Jin Yu

Abstract: Current road-trafﬁc optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road trafﬁc, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the trafﬁc signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the trafﬁc system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online inﬁnite-horizon problems. 1

4 0.08032313 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

5 0.079138756 69 nips-2006-Distributed Inference in Dynamical Systems

Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin

Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density ﬁltering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a signiﬁcant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1

6 0.076023139 154 nips-2006-Optimal Change-Detection and Spiking Neurons

7 0.074049704 149 nips-2006-Nonnegative Sparse PCA

8 0.068840198 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

9 0.067534238 66 nips-2006-Detecting Humans via Their Pose

10 0.060001384 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

11 0.055686437 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

12 0.055358719 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

13 0.055066906 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

14 0.053466685 50 nips-2006-Chained Boosting

15 0.045652397 79 nips-2006-Fast Iterative Kernel PCA

16 0.044218566 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

17 0.039895669 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

18 0.039461263 129 nips-2006-Map-Reduce for Machine Learning on Multicore

19 0.039399102 186 nips-2006-Support Vector Machines on a Budget

20 0.037571888 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.154), (1, -0.01), (2, -0.025), (3, 0.008), (4, -0.013), (5, -0.02), (6, -0.002), (7, -0.025), (8, 0.058), (9, 0.037), (10, -0.026), (11, 0.026), (12, 0.077), (13, -0.027), (14, 0.023), (15, -0.113), (16, -0.026), (17, 0.006), (18, 0.087), (19, 0.079), (20, 0.036), (21, -0.021), (22, 0.039), (23, -0.008), (24, -0.003), (25, -0.278), (26, -0.191), (27, 0.283), (28, -0.157), (29, -0.022), (30, -0.346), (31, 0.063), (32, 0.057), (33, 0.047), (34, 0.102), (35, 0.072), (36, 0.045), (37, -0.116), (38, 0.013), (39, 0.075), (40, 0.075), (41, 0.032), (42, 0.032), (43, 0.02), (44, -0.032), (45, 0.15), (46, -0.03), (47, -0.129), (48, -0.086), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94751126 96 nips-2006-In-Network PCA and Anomaly Detection

Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft

2 0.79647845 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

Author: Alfred O. Hero

3 0.4327777 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin, Sean Cutler

Abstract: We present a hierarchical Bayesian model for sets of related, but different, classes of time series data. Our model performs alignment simultaneously across all classes, while detecting and characterizing class-speciﬁc differences. During inference the model produces, for each class, a distribution over a canonical representation of the class. These class-speciﬁc canonical representations are automatically aligned to one another — preserving common sub-structures, and highlighting differences. We apply our model to compare and contrast solenoid valve current data, and also, liquid-chromatography-ultraviolet-diode array data from a study of the plant Arabidopsis thaliana. 1 Aligning Time Series From Different Classes Many practical problems over a wide range of domains require synthesizing information from several noisy examples of one or more categories in order to build a model which captures common structure and also learns the patterns of variability between categories. In time series analysis, these modeling goals manifest themselves in the tasks of alignment and difference detection. These tasks have diverse applicability, spanning speech & music processing, equipment & industrial plant diagnosis/monitoring, and analysis of biological time series such as microarray & liquid/gas chromatography-based laboratory data (including mass spectrometry and ultraviolet diode arrays). Although alignment and difference detection have been extensively studied as separate problems in the signal processing and statistical pattern recognition communities, to our knowledge, no existing model performs both tasks in a uniﬁed way. Single class alignment algorithms attempt to align a set of time series all together, assuming that variability across different time series is attributable purely to noise. In many real-world situations, however, we have time series from multiple classes (categories) and our prior belief is that there is both substantial shared structure between the class distributions and, simultaneously, systematic (although often rare) differences between them. While in some circumstances (if differences are small and infrequent), single class alignment can be applied to multi-class data, it is much more desirable to have a model which performs true multi-class alignment in a principled way, allowing for more reﬁned and accurate modeling of the data. In this paper, we introduce a novel hierarchical Bayesian model which simultaneously solves the multi-class alignment and difference detection tasks in a uniﬁed manner, as illustrated in Figure 1. The single-class alignment shown in this ﬁgure coerces the feature in region A for class 1 to be inappropriately collapsed in time, and the overall width of the main broad peak in class 2 to be inappropriately narrowed. In contrast, our multi-class model handles these features correctly. Furthermore, because our algorithm does inference for a fully probabilistic model, we are able to obtain quantitative measures of the posterior uncertainty in our results, which, unlike the point estimates produced by most current approaches, allow us to assess our relative conﬁdence in differences learned by the model. Our basic setup for multi-class alignment assumes the class labels are known for each time series, as is the case for most difference detection problems. However, as we discuss at the end of the paper, our model can be extended to the completely unsupervised case. normal abnormal 3 common structure 3 1 1 20 0 3 −20 class−specific differences 20 1 0 −20 3 class−specific models 3 A 1 1 0 50 100 150 200 250 0 50 100 150 200 Figure 1: Nine time series from the NASA valve solenoid current data set [4]. Four belong to a ‘normal’ class, and ﬁve to an ‘abnormal’ class. On all ﬁgures, the horizontal axis is time, or latent time for ﬁgures of latent traces and observed time series aligned to latent traces. The vertical axis is current amplitude. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data (over MCMC samples) within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM. Top right: Parent trace. Middle right: Class-speciﬁc energy impulses with the topmost showing the class impulses for the less smooth class. Bottom right: Child traces superimposed. Note that if one generates more HB-CPM MCMC samples, the parent cycles between the two classes since the model has no preference for which class is seen as a modiﬁcation of the other; the child classes remain stable however. 2 A Hierarchical Bayesian Continuous Proﬁle Model Building on our previous Continuous Proﬁle Model (CPM) [7], we propose a Hierarchical Bayesian Continuous Proﬁle Model (HB-CPM) to address the problems of multi-class alignment and difference detection, together, for sets of sibling time series data — that is, replicate time series from several distinct, but related classes. The HB-CPM is a generative model that allows simultaneous alignment of time series and also provides aligned canonical representations of each class along with measures of uncertainty on these representations. Inference in the model can be used, for example, to detect and quantify similarities and differences in class composition. The HB-CPM extends the basic CPM in two signiﬁcant ways: i) it addresses the multi-class rather than the single-class alignment problem, and ii) it uses a fully Bayesian framework rather than a maximum likelihood approach, allowing us to estimate uncertainty in both the alignments and the canonical representations. Our model, depicted in Figure 2, assumes that each observed time series is generated as a noisy transformation of a single, class-speciﬁc latent trace. Each latent trace is an underlying, noiseless representation of the set of replicated, observable time series belonging to a single class. An observed time series is generated from this latent trace exactly as in the original CPM, by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as with an HMM. Each hidden state corresponds to a ‘latent time’ in the latent trace. Thus different choices of hidden state sequences result in different nonlinear transformations of the underlying trace. The HB-CPM uses a separate latent trace for each class, which we call child traces. Crucially, each of these child traces is generated from a single parent trace (also unobserved), which 250 captures the common structure among all of the classes. The joint prior distribution for the child traces in the HB-CPM model can be realized by ﬁrst sampling a parent trace, and then, for each class, sampling a sparse ‘difference vector’ which dictates how and where each child trace should differ from the common parent. z Figure 2: Core elements of the HB-CPM, illustrated with two-class data (hidden and observed) drawn from the model’s prior. parent r1 r2 z1 impulse z2 child trace child trace x1 x2 x3 class 1 observed time series 2.1 impulse x4 x5 x6 class 2 observed time series The Prior on Latent Traces Let the vector xk = (xk , xk , ..., xk ) represent the k th observed scalar time series, and w k ∈ 1..C 1 2 N be the class label of this time series. Also, let z = (z1 , z2 , ..., zM ) be the parent trace, and c c c z c = (z1 , z2 , ..., zM ) be the child trace for the cth class. During inference, posterior samples of c z form a canonical representation of the observed times series in class c, and z contains their common sub-structure. Ideally, the length of the latent traces, M , would be very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overﬁtting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments, and found this to be sufﬁcient with < 0.2. Because the resolution of the latent traces is higher than that of the observed time series, experimental time can be made to effectively speed up or slow down by advancing along the latent trace in larger or smaller jumps. As mentioned previously, the child traces in the HB-CPM inherit most of their structure from a common parent trace. The differences between child and parent are encoded in a difference vector for each class, dc = (dc , dc , ..., dc ); normally, most elements of dc are close to zero. Child traces 1 2 M are obtained by adding this difference vector to the parent trace: z c = z + dc . We model both the parent trace and class-speciﬁc difference vectors with what we call an energy impulse chain, which is an undirected Markov chain in which neighbouring nodes are encouraged to be similar (i.e., smooth), and where this smoothness is perturbed by a set of marginally independent energy impulse nodes, with one energy impulse node attached to each node in the chain. For the difc c c ference vector of the cth class, the corresponding energy impulses are denoted r c = (r1 , r2 , ..., rM ), and for the parent trace the energy impulses are denoted r = (r1 , r2 , ..., rM ). Conditioned on the energy impulses, the probability of a difference vector is p(dc |r c , αc , ρc ) = 1 1 exp − Zr c 2 M −1 i=1 M c (dc − dc )2 (dc − ri )2 i i+1 i + c c α ρ i=1 . (1) Here, Zrc is the normalizing constant for this probability density, αc controls the smoothness of the chain, and ρc controls the inﬂuence of the energy impulses. Together, αc and ρc also control the overall tightness of the distribution for dc . Presently, we set all αc = α , and similarly ρc = ρ — that is, these do not differ between classes. Similarly, the conditional probability of the parent trace is p(z|r, α, ρ) = 1 1 exp − Zr 2 M −1 i=1 M (zi − zi+1 )2 (zi − ri )2 + α ρ i=1 . (2) These probability densities are each multivariate Gaussian with tridiagonal precision matrixes (corresponding to the Markov nature of the interactions). Each component of each energy impulse for the parent, rj , is drawn independently from a single univariate Gaussian, N (ri |µpar , spar ), whose mean and variance are in turn drawn from a Gaussian and inverse-gamma, respectively. The class-speciﬁc difference vector impulses, however, are drawn from a mixture of two zero-mean Gaussians — one ‘no difference’ (inlier) Gaussian, and one ‘classdifference’ (outlier) Gaussian. The means are zero so as to encourage difference vectors to be near c zero (and thus child traces to be similar to the parent trace). Letting δi denote the binary latent c mixture component indicator variables for each rj , c c c c p(δj ) = Multinomial(δj |mc , mc ) = (mc )δj (mc )1−δj in out in out c c p(rj |δj ) = c N (rj |0, s2 ), in c N (rj |0, s2 ), out if if c δj c δj =1 . =0 (3) (4) Each Gaussian mixture variance has an Inverse-Gamma prior, which for the ‘no difference’ variance, s2 , is set to have very low mean (and not overly dispersed) so that ‘no difference’ regions truly have in little difference from the parent class, while for the ‘class-difference’ variance, s 2 , the prior is out set to have a larger mean, so as to model our belief that substantial class-speciﬁc differences do occasionally exist. The priors for αc , ρc , α, ρ are each log-normal (inverse-gamma priors would not be conjugate in this model, so we use log-normals which are easier to specify). Additionally, the mixing proportions, mc , mc , have a Dirichlet prior, which typically encodes our belief that the out in proportion that are ‘class differences’ is likely to be small. 2.2 The HMM Portion of the Model Each observed xk is modeled as being generated by an HMM conditioned on the appropriate child k trace, z w . The probability of an observed time series conditioned on a path of hidden time states, k N wk τ k , and the child trace, is given by p(xk |z w , τ k ) = i=1 N (xk |zτ k uk , ξ k ), where ξ k is the i i emission variance for time series k, and the scale factor, uk , allows for constant, global, multiplicak tive rescaling. The HMM transition probabilities T k (τi−1 → τik ) are multinomial within a limited k k range, with p (τi = a|τi−1 = b) = κ(a−b) for (a − b) ∈ [1, Jτ ] and pk (τi = a|τi−1 = b) = 0 for (a − b) < 1 or (a − b) > Jτ where Jτ is the maximum allowable number of consecutive time Jτ states that can be advanced in a single transition. (Of course, i=1 κk = 1.) This multinomial disi tribution, in turn, has a Dirichlet prior. The HMM emission variances, ξ k , have an inverse-gamma prior. Additionally, the prior over the ﬁrst hidden time state is a uniform distribution over a constant number of states, 1..Q, where Q deﬁnes how large a shift can exist between any two observed time series. The prior over each global scaling parameter, uk , is a log-normal with ﬁxed variance and mean of zero, which encourages the scaling factors to remain near unity. 3 Posterior Inference of Alignments and Parameters by MCMC Given a set of observed time series (and their associated class labels), the main computational operation to be performed in the HB-CPM is inference of the latent traces, alignment state paths and other model parameters. Exact inference is analytically intractable, but we are able to use Markov Chain Monte Carlo (MCMC) methods to create an iterative algorithm which, if run for sufﬁciently long, produces samples from the correct posterior distribution. This posterior provides simultaneous alignments of all observed time series in all classes, and also, crucially, aligned canonical representations of each class, along with error bars on these representations, allowing for a principled approach to difference detection in time series data from different classes. We may also wish to obtain a posterior estimate of some of our parameters conditioned on the data, and marginalized over the other parameters. In particular, we might be interested in obtaining the posterior over hidden time state vectors for each time series, τ k , which together provide a simultaneous, multi-class alignment of our data. We may, in addition, or, alternatively, be interested in the posterior of the child traces, z c , which together characterize how the classes agree and disagree. The former may be more of interest for visualizing aligned observed time series, or in expanding out aligned scalar time series to a related vector time series, while the latter would be more of interest when looking to characterize differences in multi-class, scalar time series data. We group our parameters into blocks, and sample these blocks conditioned on the values of the other parameters (as in Gibbs sampling) — however, when certain conditional distributions are not amenable to direct sampling, we use slice sampling [8]. The scalar conditional distributions for each c of µpar , spar , mc , mc , δj , κk are known distributions, amenable to direct sampling. The conditional out i in distributions for the scalars αc , ρc , α, ρ and uk are not tractable, and for each of these we use slice sampling (doubling out and shrinking). The conditional distribution for each of r and r c is multivariate Gaussian, and we sample directly from each using a Cholesky decomposition of the covariance matrix. 1 p(r|z, α, ρ) = p(z|r, α, ρ)p(r) = N (r|c, C) (5) Z 1 p(r c |dc , αc , ρc ) = p(dc |r, αc , ρc )p(r) = N (r c |b, B), (6) Z where, using I to denote the identity matrix, −1 µpar z S + Ispar −1 +I (7) , c=C C= ρ2 ρ spar B= S† 2 (ρc ) −1 +v c −1 , b=B dc . ρc (8) −1 The diagonal matrix v c consists of mixture component variances (s2 or s2 ). S −1 [or S † ] is the out in tridiagonal precision matrix of the multivariate normal distribution p(z|r, α, ρ) [or p(d c |r c , αc , ρc )], −1 −1 2 1 1 1 and has entries Sj,j = α + ρ for j = 2..(M − 1), Sj,j = α + ρ for j = 1, M , and −1 −1 −1 1 Sj,j+1 = Sj+1,j = − α [or analogously for S † ]. The computation of C and B can be made more efﬁcient by using the Sherman-Morrison-Woodbury matrix inversion lemma. For example, −1 −1 −1 −1 −1 B = (ρ1)2 (S † − S † (v c + S † )−1 S † ), and we have S −1 [or S † ] almost for free, and c no longer need to invert S [or S † ] to obtain it. The conditional distributions of each of z, z c are also multivariate Gaussians. However, because of the underlying Markov dependencies, their precision matrixes are tridiagonal, and hence we can use belief propagation, in the style of Kalman ﬁltering, followed by a stochastic traceback to sample from them efﬁciently. Thus each can be sampled in time proportional to M rather than M 3 , as required for a general multivariate Gaussian. Lastly, to sample from the conditional distribution of the hidden time vectors for each sample, τ k , we run belief propagation (analogous to the HMM forward-backward algorithm) followed by a stochastic traceback. In our experiments, the parent trace was initialized by averaging one smoothed example from each class. The child traces were initialized to the initial parent trace. The HMM states were initialized by a Viterbi decoding with respect to the initial values of the other parameters. The scaling factors were initialized to unity, and the child energy impulses to zero. MCMC was run for 5000 iterations, with convergence generally realized in less than 1000 iterations. 4 Experiments and Results We demonstrate use of the HB-CPM on two data sets. The ﬁrst data set is the part of the NASA shuttle valve data [4], which measures valve solenoid current against time for some ‘normal’ runs and some ‘abnormal’ runs. Measurements were taken at a rate of 1ms per sample, with 1000 samples per time series. We subsampled the data by a factor of 7 in time since it was extremely dense. The results of performing posterior inference in our model on this two-class data set are shown in Figure 1. They nicely match our intuition of what makes a good solution. In our experiments, we also compared our model to a simple “single-class” version of the HB-CPM in which we simply remove the child trace level of the model, letting all observed data in both classes depend directly on one single parent trace. The single-class alignment, while doing a reasonable job, does so by coercing the two classes to look more similar than they should. This is evident in one particular region labeled on the graph and discussed in the legend. Essentially a single class alignment causes us to lose class-speciﬁc ﬁne detail — the precise information we seek to retain for difference detection. The second data set is from a botany study which uses reverse-phase HPLC (high performance liquid chromatography) as a high-throughput screening method to identify genes involved in xenobiotic uptake and metabolism in the model plant Arabidopsis thaliana. Liquid-chromatography (LC) techniques are currently being developed and reﬁned with the aim of providing a robust platform with which to detect differences in biological organisms — be they plants, animals or humans. Detected differences can reveal new fundamental biological insight, or can be applied in more clinical settings. LC-mass spectrometry technology has recently undergone explosive growth in tackling the problem of biomarker discovery — for example, detecting biological markers that can predict treatment outcome or severity of disease, thereby providing the potential for improved health care and better understanding of the mechanisms of drug and disease. In botany, LC-UV data is used to help understand the uptake and metabolism of compounds in plants by looking for differences across experimental conditions, and it is this type of data that we use here. LC separates mixtures of analytes on the basis of some chemical property — hydrophobicity, for reverse-phase LC, used to generate our data. Components of the analyte in our data set were detected as they came off the LC column with a Diode Array Detector (DAD), yielding UV-visible spectra collected at 540 time points (we used the 280 nm band, which is informative for these experiments). We performed background subtraction [2] and then subsampled this data by a factor of four. This is a three-class data set, where the ﬁrst class is untreated plant extract, followed by two classes consisting of this same plant treated with compounds that were identiﬁed as possessing robust uptake in vivo, and, hence, when metabolized, provide a differential LC-UV signal of interest. Figure 3 gives an overview of the LC-UV results, while Figure 4 zooms in on a particular area of interest to highlight how subtle differences can be detected by the HB-CPM, but not by a singleclass alignment scheme. As with the NASA data set, a single-class alignment coerces features across classes that are in fact different to look the same, thereby preventing us from detecting them. Recall that this data set consists of a ‘no treatment’ plant extract, and two ‘treatments’ of this same plant. Though our model was not informed of these special relationships, it nevertheless elegantly captures this structure by giving almost no energy impulses to the ‘no treatment’ class, meaning that this class is essentially the parent trace, and allowing the ‘treatment’ classes to diverge from it, thereby nicely matching the reality of the situation. All averaging over MCMC runs shown is over 4000 samples, after a 1000 burn in period, which took around 3 hours for the NASA data, and 5 hours for the LC data set, on machines with dual 3 GHz Pentium 4 processors. 5 Related Work While much work has been done on time series alignment, and on comparison/clustering of time series, none of this work, to our knowledge, directly addresses the problem presented in this paper — simultaneously aligning and comparing sets of related time series in order to characterize how they differ from one another. The classical algorithm for aligning time series is Dynamic Time Warping (DTW) [10]. DTW works on pairs of time series, aligning one time series to a speciﬁed reference time, in a non-probabilistic way, without explicit allowance for differences in related time series. More recently, Gaffney et al [5] jointly clustered and aligned time series data from different classes. However, their model does not attempt to put time series from different classes into correspondence with one another — only time series within a class are aligned to one another. Ziv Bar-Joseph et al [1] use a similar approach to cluster and align microarray time series data. Ramsay et al [9] have introduced a curve clustering model, in which a time warping function, h(t), for each time series is learned by way of learning its relative curvature, parameterized with order one B-spline coefﬁcients. This model accounts for 5 5 3 3 9 0 5 −9 9 0 −9 9 3 0 −9 5 5 3 3 0 50 100 150 200 250 300 0 50 100 150 200 250 Figure 3: Seven time series from each of three classes of LC-UV data. On all ﬁgures, the horizontal axis is time, or latent time for ﬁgures of latent traces and observed time series aligned to latent traces. The vertical axis is log of UV absorbance. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM model. Top right: Parent trace. Middle right: Class-speciﬁc energy impulses, with the top-most showing the class impulses for the ‘no treatment’ class. Bottom right: Child traces superimposed. See Figure 4 for a zoom-in in around the arrow. systematic changes in the range and domain of time series in a way that aligns curves with the same fundamental shape. However, their method does not allow for class-speciﬁc differences between shapes to be taken into account. The anomaly detection (AD) literature deals with related, yet distinct problems. For example, Chan et al [3] build a model of one class of time series data (they use the same NASA valve data as in this paper), and then match test data, possibly belonging to another class (e.g. ‘abnormal’ shuttle valve data) to this model to obtain an anomaly score. Emphasis in the AD community is on detecting abnormal events relative to a normal baseline, in an on-line manner, rather than comparing and contrasting two or more classes from a dataset containing examples of all classes. The problem of ‘elastic curve matching‘ is addressed in [6], where a target time series that best matches a query series is found, by mapping the problem of ﬁnding the best matching subsequence to the problem of ﬁnding the cheapest path in a DAG (directed acyclic graph). 6 Discussion and Conclusion We have introduced a hierarchical, Bayesian model to perform detection of rare differences between sets of related time series, a problem which arises across a wide range of domains. By training our model, we obtain the posterior distribution over a set of class-speciﬁc canonical representations of each class, which are aligned in a way that preserves their common sub-structures, yet retains and highlights important differences. This model can be extended in several interesting and useful ways. One small modiﬁcation could be useful for the LC-UV data set presented in this paper, in which one of the classes was ‘no treatment’, while the other two were each a different ‘treatment’. We might model the ‘no treatment’ as the parent trace, and each of the treatments as a child trace, so that the direct comparison of interest would be made more explicit. Another direction would be to apply the HB-CPM in a completely 300 4 5 0 3 −5 4 5 0 3 −5 4 5 0 3 −5 100 105 110 115 120 125 130 135 140 145 150 155 100 105 110 115 120 125 130 135 140 145 150 155 Figure 4: Left: A zoom in of data displayed in Figure 3, from the region of time 100-150 (labeled in that ﬁgure in latent time, not observed time). Top left: mean and standard deviation of the unaligned data. Middle left: mean and standard deviation of the single-class alignment. Bottom left: mean and standard deviation of the child traces from the HB-CPM. A case in point of a difference that could be detected with the HB-CPM and not in the raw or single-class aligned data, is the difference occurring at time point 127. Right: The mean and standard deviation of the child energy impulses, with dashed lines showing correspondences with the child traces in the bottom left panel. unsupervised setting where we learn not only the canonical class representations, but also obtain the posterior over the class labels by introducing a latent class indicator variable. Lastly, one could use a model with cyclical latent traces to model cyclic data such as electrocardiogram (ECG) and climate data. In such a model, an observed trace being generated by the model would be allowed to cycle back to the start of the latent trace, and the smoothness constraints on the trace would be extended to apply to beginning and end of the traces, coercing these to be similar. Such a model would allow one to do anomaly detection in cyclic data, as well as segmentation. Acknowledgments: Thanks to David Ross and Roland Memisevic for useful discussions, and Ben Marlin for his Matlab slice sampling code. References [1] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39–48, 2002. [2] H. Boelens, R. Dijkstra, P. Eilers, F. Fitzpatrick, and J. Westerhuis. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and raman spectroscopic detection. Journal of Chromatography A, 1057:21–30, 2004. [3] P. K. Chan and M. V. Mahoney. Modeling multiple time series for anomaly detection. In ICDM, 2005. [4] B. Ferrell and S. Santuro. NASA shuttle valve data. http://www.cs.fit.edu/∼pkc/nasa/data/, 2005. [5] S. J. Gaffney and P. Smyth. Joint probabilistic curve clustering and alignment. In Advances in Neural Information Processing Systems 17, 2005. [6] L. Latecki, V. Megalooikonomou, Q. Wang, R. Lakaemper, C. Ratanamahatana, and E. Keogh. Elastic partial matching of time series, 2005. [7] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems 17, 2005. [8] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [9] J. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society(B), 60, 1998. [10] H. Sakoe and S. Chiba. Dynamic programming algorithm for spoken word recognition. Readings in Speech Recognition, pages 159–165, 1990.

4 0.33373895 149 nips-2006-Nonnegative Sparse PCA

Author: Ron Zass, Amnon Shashua

Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various ﬁelds, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efﬁcient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1

5 0.31224045 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

Author: Silvia Richter, Douglas Aberdeen, Jin Yu

6 0.29814526 69 nips-2006-Distributed Inference in Dynamical Systems

7 0.26778695 155 nips-2006-Optimal Single-Class Classification Strategies

8 0.26449475 129 nips-2006-Map-Reduce for Machine Learning on Multicore

9 0.26337391 174 nips-2006-Similarity by Composition

10 0.23594901 154 nips-2006-Optimal Change-Detection and Spiking Neurons

11 0.23355056 50 nips-2006-Chained Boosting

12 0.22442162 66 nips-2006-Detecting Humans via Their Pose

13 0.22223966 105 nips-2006-Large Margin Component Analysis

14 0.21899197 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

15 0.21137032 79 nips-2006-Fast Iterative Kernel PCA

16 0.2034533 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

17 0.19935879 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

18 0.19889607 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

19 0.19805205 139 nips-2006-Multi-dynamic Bayesian Networks

20 0.19743484 18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.067), (3, 0.012), (7, 0.053), (9, 0.034), (20, 0.012), (22, 0.058), (44, 0.55), (57, 0.062), (65, 0.036), (69, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99091715 69 nips-2006-Distributed Inference in Dynamical Systems

Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin

same-paper 2 0.95808882 96 nips-2006-In-Network PCA and Anomaly Detection

Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft

3 0.95357281 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

4 0.9225508 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean ﬁeld approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-speciﬁed variational approximations based on mean ﬁeld theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean ﬁeld. Moreover, experiments on a well-studied problem— inferring the stable conﬁgurations of the Ising spin glass—show that the solutions can be signiﬁcantly better than those obtained using sum-product-based methods. 1

5 0.86328477 139 nips-2006-Multi-dynamic Bayesian Networks

Author: Karim Filali, Jeff A. Bilmes

Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-speciﬁc independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efﬁcient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and beneﬁts, including Bayesian networks, undirected Markov random ﬁelds, and factor graphs. Moreover, for different types of scientiﬁc modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indeﬁnitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random ﬁelds (CRFs), where an underlying set of variables gets repeated as needed to ﬁll any ﬁnite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the speciﬁcation of rules for expanding any particular instance of a model is ﬁnite. In other words, these forms of GM allow the speciﬁcation of models with an unlimited number of random variables (RVs) using a ﬁnite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the speciﬁcation of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s ﬁnite description length, we utilize a switching parent functionality (also called value-speciﬁc independence). Second, in order to capture the notion of fertility in MT-systems (deﬁned later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more speciﬁc class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well deﬁned over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given ﬁxed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a ﬁnite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, speciﬁes how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a ﬁrst order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difﬁcult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the ﬁgure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay ﬁnite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in ﬁg. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the ﬁgure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the ﬁgure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisﬁed2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of conﬁgurations exponential in the number of constrained variables, while typically only a small fraction of the conﬁgurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the ﬁgure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisﬁed. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it ofﬂine, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an ofﬂine triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the ﬂexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style ofﬂine triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and brieﬂy describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (ﬁg. 1) since it constitutes the ﬁrst step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the ﬁrst pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the ﬁgure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufﬁcient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the ﬁrst one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all conﬁgurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable conﬁgurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the ﬁrst level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in ﬁg. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the ﬁgure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in ﬁg. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the conﬁguration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the ﬁrst pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the ﬁrst pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively ﬁnd the optimal variable conﬁguration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantiﬁes how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest speciﬁed percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefﬁcient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine ﬂavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.

6 0.70902312 159 nips-2006-Parameter Expanded Variational Bayesian Methods

7 0.70620459 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier

8 0.6649552 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

9 0.66164792 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

10 0.65424645 98 nips-2006-Inferring Network Structure from Co-Occurrences

11 0.6505717 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

12 0.64211464 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

13 0.63942528 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

14 0.62883329 121 nips-2006-Learning to be Bayesian without Supervision

15 0.62386352 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

16 0.62253886 109 nips-2006-Learnability and the doubling dimension

17 0.61773902 192 nips-2006-Theory and Dynamics of Perceptual Bistability

18 0.61515659 193 nips-2006-Tighter PAC-Bayes Bounds

19 0.60756201 175 nips-2006-Simplifying Mixture Models through Function Approximation

20 0.60572153 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions