nips nips2007 nips2007-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Emre Kiciman, David Maltz, John C. Platt
Abstract: Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approximate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a mean-field variational approximation and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Maltz Abstract Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. [sent-4, score-0.28]
2 We use approximate Bayesian inference to diagnose problems with web services. [sent-5, score-0.36]
3 This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. [sent-6, score-0.889]
4 Further, such inference must be performed in less than a second. [sent-7, score-0.125]
5 Inference can be done at this speed by combining a mean-field variational approximation and the use of stochastic gradient descent to optimize a variational cost function. [sent-8, score-0.807]
6 We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. [sent-9, score-0.429]
7 The inference is fast enough to analyze network logs with billions of entries in a matter of hours. [sent-10, score-0.386]
8 1 Introduction Internet content providers, such as MSN, Google and Yahoo, all depend on the correct functioning of the wide-area Internet to communicate with their users and provide their services. [sent-11, score-0.129]
9 When these content providers lose network connectivity with some of their users, it is critical that they quickly resolve the problem, even if the failure lies outside their own systems. [sent-12, score-0.481]
10 1 One challenge is that content providers have little direct visibility into the wide-area Internet infrastructure and the causes of user request failures. [sent-13, score-0.468]
11 Requests may fail because of problems in the content provider’s systems or faults in the network infrastructure anywhere between the user and the content provider, including routers, proxies, firewalls, and DNS servers. [sent-14, score-0.821]
12 Other failing requests may be due to denial of service attacks or bugs in the user’s software. [sent-15, score-0.188]
13 To compound the diagnosis problem, these faults may be intermittent: we must use probabilistic inference to perform diagnosis, rather than using logic. [sent-16, score-0.889]
14 Not only do popular Internet content providers receive billions of HTTP requests a week, but the number of potential causes of failure are numerous. [sent-18, score-0.561]
15 In this paper, we show that approximate Bayesian inference scales to handle this high rate of observations and accurately estimates the underlying failure rates of such a large number of potential causes of failure. [sent-20, score-0.52]
16 To scale Bayesian inference to Internet-sized problems, we must make several simplifying approximations. [sent-21, score-0.149]
17 First, we introduce a bipartite graphical model using overlapping noisyORs, to model the interactions between faults and observations. [sent-22, score-0.573]
18 Second, we use mean1 A loss of connectivity to users translates directly into lost revenue and a sullied reputation for content providers, even if the cause of the problem is a third-party network component. [sent-23, score-0.276]
19 1 field variational inference to map the diagnosis problem to a reasonably-sized optimization problem. [sent-24, score-0.653]
20 Third, we further approximate the integral in the variational method. [sent-25, score-0.243]
21 Fourth, we speed up the optimization problem using stochastic gradient descent. [sent-26, score-0.296]
22 We describe the graphical model in Section 2, and the approximate inference in that model in Section 2. [sent-29, score-0.228]
23 We present inference results on synthetic and real data in Section 4 and then draw conclusions. [sent-31, score-0.165]
24 1 Previous Work The original application of Bayesian diagnosis was medicine. [sent-33, score-0.302]
25 One of the original diagnosis network was QMR-DT [14], a bipartite graphical model that used noisy-OR to model symptoms given diseases. [sent-34, score-0.535]
26 Exact inference in such networks is intractable (exponential in the number of positive symptoms,[2]), so different approximation and sampling algorithms were proposed. [sent-35, score-0.154]
27 Shwe and Cooper proposed likelihood-weighted sampling [13], while Jaakkola and Jordan proposed using a variational approximation to unlink each input to the network [3]. [sent-36, score-0.295]
28 More recently, researchers have applied Bayesian techniques for the diagnosis of computers and networks [1][12][16]. [sent-38, score-0.363]
29 This work has tended to avoid inference in large networks, due to speed constraints. [sent-39, score-0.181]
30 In contrast, we attack the enormous inference problem directly. [sent-40, score-0.174]
31 2 Graphical model of diagnosis Beta Bernoulli Noisy OR Figure 1: The full graphical model for the diagnosis of Internet faults The initial graphical model for diagnosis is shown in Figure 1. [sent-41, score-1.482]
32 The matrix rij is typically very sparse, because there are only a small number of possible causes for the failure of any request. [sent-44, score-0.326]
33 The ri0 parameter models the probability of a spontaneous failure without any known cause. [sent-45, score-0.116]
34 The rij are set by elicitation of probabilities from an expert. [sent-46, score-0.153]
35 All of these underlying causes are modeled independently for each request, because possible faults in the system can be intermittent. [sent-49, score-0.545]
36 Each of the Bernoulli variables Dij depends on an underlying continuous fault rate variable Fj ∈ [0, 1]: d P (Dij |Fj = µj ) = µj ij (1 − µj )1−dij , (2) where µj is the probability of a fault manifesting at any time. [sent-50, score-0.592]
37 We model the Fj as independent Beta distributions, one for each fault: p(Fj = µj ) = 1 α0 −1 j (1 0 , β 0 ) µj B(αj j 0 − µj )βj −1 , (3) where B is the beta function. [sent-51, score-0.131]
38 The fan-out for each of these fault rates can be different: some of these fault rates are connected to many observations, while less common ones are connected to fewer. [sent-52, score-0.61]
39 Our goal is to model the posterior distribution P (F |V ) in order to identify hidden faults and track them through time. [sent-53, score-0.536]
40 The model is now completely analogous to the QMR-DT mode [14], but instead of the noisy-OR combining binary random variables, they combine rate variables: P (Vi = fail|Fj = µj ) = 1 − (1 − ri0 ) (1 − rij µj ). [sent-58, score-0.183]
41 1 Approximations to make inference tractable In order to scale inference up to 104 hidden variables, and 105 observations, we choose a simple, robust approximate inference algorithm: mean-field variational inference [4]. [sent-61, score-0.796]
42 Meanfield variational inference approximates the posterior P (F |V ) with a factorized distribution. [sent-62, score-0.367]
43 For inferring fault rates, we choose to approximate P with a product of beta distributions q(Fj |V ) = Q(F |V ) = j j 1 α −1 µ j (1 − µj )βj −1 . [sent-63, score-0.445]
44 B(αj , βj ) j 3 (5) Mean-field variational inference maximizes a lower bound on the evidence of the model: max L = Q(µ|V ) log α,β P (V |µ)p(µ) Q(µ|V ) dµ. [sent-64, score-0.357]
45 (6) This integral can be broken into two terms: a cross-entropy between the approximate posterior and the prior, and an expected log-likelihood of the observations: max L = − Q(µ|V ) log α,β Q(µ|V ) dµ + log P (V |F ) p(µ) Q . [sent-65, score-0.197]
46 The second term then becomes the sum of a set of log likelihoods, one per observation: L(Vi ) = log 1 − (1 − ri0 ) log(1 − ri0 ) + j j [1 − rij αj /(αj + βj )] log[1 − rij αj /(αj + βj )] if Vi = 1 (failure); if Vi = 0 (success). [sent-69, score-0.376]
47 (9) For the Internet diagnosis case, the MF(0) approximation is reasonable: we expect the posterior distribution to be concentrated around its mean, due to the large amount of data that is available. [sent-70, score-0.347]
48 (10) i Variational inference by stochastic gradient descent In order to apply unconstrained optimization algorithms to minimize (10), we need transform the variables: only positive αj and βj are valid, so we parameterize them by αj = eaj , βj = ebj . [sent-73, score-0.486]
49 and the gradient computation becomes ∂C ∂DKL (qj ||pj ) = αj − ∂aj ∂αj j i (11) ∂L(Vi ) . [sent-74, score-0.127]
50 Note that this gradient computation can be quite computationally expensive, given that i sums over all of the observations. [sent-76, score-0.127]
51 For Internet diagnosis, we can decompose the observation stream into blocks, where the size of the block is determined by how quickly the underlying rates of faults change, and how finely we want to sample those rates. [sent-77, score-0.63]
52 We typically use blocks of 100,000 observations, which can make the computation of the gradient expensive. [sent-78, score-0.168]
53 Further, we repeat the inference over and over again, on thousands of blocks of data: we prefer a fast optimization procedure over a highly accurate one. [sent-79, score-0.22]
54 Therefore, we investigated the use of stochastic gradient descent for optimizing the variational cost function. [sent-80, score-0.554]
55 The details of stochastic gradient descent are shown in Algorithm 1. [sent-83, score-0.332]
56 For example, the sign of a single L(Vi ) gradient term depends only on the sign of Vi . [sent-85, score-0.127]
57 In order to reduce the noise in the estimate, we use momentum [15]: we exponentially smooth the gradient with a first-order filter before applying it to the state variables. [sent-86, score-0.224]
58 99), in order to both react quickly to changes in the fault rate and to smooth out noise. [sent-90, score-0.327]
59 Stochastic gradient descent can be used as a purely on-line method (where each data point is seen only once), setting the “number of epochs” in Algorithm 1 to 1. [sent-91, score-0.248]
60 1 Other possible approaches We considered and tested several other approaches to solving the approximate inference problem. [sent-94, score-0.171]
61 Jaakkola and Jordan propose a variational inference method for bipartite noisy-OR networks [3], where one variational parameter is introduced to unlink one observation from the network. [sent-95, score-0.689]
62 We typically have far more observations than possible faults: this previous approach would have forced us to solve very large optimization problems (with 100,000 parameters). [sent-96, score-0.112]
63 We originally optimized the variational cost function (10) with both BFGS and the trustregion algorithm in the Matlab optimization toolbox. [sent-98, score-0.251]
64 This turned out to be far worse than stochastic gradient descent. [sent-99, score-0.211]
65 We report on the L-BFGS performance, below: it is within 4x the speed of the stochastic gradient descent. [sent-101, score-0.267]
66 5 We experimented with Metropolis-Hastings to sample from the posterior, using a Gaussian random walk in (aj , bj ). [sent-102, score-0.133]
67 In the Internet diagnosis network, the fan-out is quite high (because a single fault affects many observations). [sent-105, score-0.57]
68 Finally, we did not try the idea of learning to predict the posterior from the observations by sampling from the generative model and learning the reverse mapping [7]. [sent-109, score-0.128]
69 For Internet diagnosis, we do not know the structure of graphical model for a block of data ahead of time: the structure depends on the metadata for the requests in the log. [sent-110, score-0.227]
70 4 Results We test the approximations and optimization methods used for Internet diagnosis on both synthetic and real data. [sent-112, score-0.371]
71 1 Synthetic data with known hidden state Testing the accuracy of approximate inference is very difficult, because, for large graphical models, the true posterior distribution is intractable. [sent-114, score-0.329]
72 We start by generating fault rates from a prior (here, 2000 faults drawn from Beta(5e3,1)). [sent-116, score-0.767]
73 We randomly generate connections from faults to observations, with probability 5 × 10−3 . [sent-117, score-0.462]
74 Each connection has a strength rij drawn randomly from [0, 1]. [sent-118, score-0.153]
75 Given that the number of observations is much larger than the number of faults, we expect that the posterior distribution should tightly cluster around the rate that generated the observations. [sent-121, score-0.158]
76 Difference between the true rate and the mean of the approximate posterior should reflect inaccuracies in the estimation. [sent-122, score-0.121]
77 There is a slight systematic bias in the stochastic gradient descent, as compared to L-BFGS. [sent-146, score-0.211]
78 However, the improvement in speed shown in Table 1 is worth the loss of accuracy: we need inference to 6 be as fast as possible to scale to billions of samples. [sent-147, score-0.32]
79 2 Real data from web server logs We then tested the algorithm on real data from a major web service. [sent-155, score-0.445]
80 Each observation consists of a success or failure of a single HTTP request. [sent-156, score-0.154]
81 We selected 18848 possible faults that occur frequently in the dataset, including the web server that received the request, which autonomous system that originated the request, and which “user agent” (brower or robot) generated the request. [sent-157, score-0.775]
82 We have been analyzing HTTP logs collected over several months with the stochastic gradient descent algorithm. [sent-158, score-0.429]
83 5 hour window containing an anomalously high rate of failures, in order to demonstrate that our algorithm can help us understand the cause of failures based on observations in a real-world environment. [sent-160, score-0.228]
84 We the the the broke the time series of observations into blocks of 100,000 observations, and inferred hidden rates for each block. [sent-161, score-0.19]
85 Thus, for stochastic gradient descent, momentum variables were carried forward from block to block. [sent-163, score-0.346]
86 1 0 8:00 PM 8:29 PM 8:57 PM 9:26 PM 9:55 PM 10:24 PM 10:53 PM 11:21 PM Figure 4: The inferred fault rate for two Autonomous Systems, as a function of time. [sent-173, score-0.298]
87 In this figure, we used stochastic gradient descent and a Beta(0. [sent-176, score-0.332]
88 The figure shows the only two faults whose probability went higher than 0. [sent-178, score-0.462]
89 This could be due to a router that is in common between them, or perhaps an denial of service attack that originated in that city. [sent-180, score-0.232]
90 This allows us to go through logs containing billions of entries in a matter of hours. [sent-183, score-0.187]
91 7 5 Conclusions This paper presents high-speed variational inference to diagnose problems on the scale of the Internet. [sent-184, score-0.419]
92 Given observations at a web server, the diagnosis can determine whether a web server needs rebooting, whether part of the Internet is broken, or whether the web server is compatible with a browser or user agent. [sent-185, score-1.044]
93 In order to scale inference up to Internet-sized diagnosis problems, we make several approximations. [sent-186, score-0.451]
94 First, we use mean-field variational inference to approximate the posterior distribution. [sent-187, score-0.413]
95 The expected log likelihood inside of the variational cost function is approximated with the MF(0) approximation. [sent-188, score-0.257]
96 Finally, we use stochastic gradient descent to perform the variational optimization. [sent-189, score-0.529]
97 We are currently using variational stochastic gradient descent to analyze logs that contain billions of requests. [sent-190, score-0.716]
98 We are not aware of any other applications of variational inference at this scale. [sent-191, score-0.322]
99 Future publications will include conclusions of such analysis, and implications for web services and the Internet at large. [sent-192, score-0.116]
100 Probabilistic diagnosis using a reformulation of the INTERNIST1/QMR knowledge base. [sent-298, score-0.302]
wordName wordTfidf (topN-words)
[('faults', 0.462), ('diagnosis', 0.302), ('fault', 0.268), ('internet', 0.233), ('variational', 0.197), ('rij', 0.153), ('vi', 0.151), ('dij', 0.146), ('aj', 0.136), ('bj', 0.133), ('beta', 0.131), ('gradient', 0.127), ('inference', 0.125), ('providers', 0.122), ('descent', 0.121), ('pm', 0.119), ('request', 0.119), ('server', 0.116), ('web', 0.116), ('failure', 0.116), ('logs', 0.097), ('momentum', 0.097), ('requests', 0.09), ('billions', 0.09), ('content', 0.086), ('qj', 0.085), ('stochastic', 0.084), ('observations', 0.083), ('dkl', 0.081), ('zj', 0.08), ('pj', 0.078), ('diagnose', 0.073), ('provider', 0.073), ('shwe', 0.073), ('symptoms', 0.073), ('fj', 0.072), ('failures', 0.065), ('mf', 0.065), ('sgd', 0.064), ('yj', 0.06), ('causes', 0.057), ('graphical', 0.057), ('speed', 0.056), ('bipartite', 0.054), ('fail', 0.054), ('jaakkola', 0.052), ('user', 0.052), ('jordan', 0.051), ('cause', 0.05), ('network', 0.049), ('ases', 0.049), ('attack', 0.049), ('denial', 0.049), ('intermittent', 0.049), ('router', 0.049), ('unlink', 0.049), ('service', 0.049), ('connectivity', 0.048), ('approximate', 0.046), ('posterior', 0.045), ('autonomous', 0.045), ('epochs', 0.043), ('users', 0.043), ('metadata', 0.042), ('blocks', 0.041), ('synthetic', 0.04), ('eld', 0.04), ('uai', 0.039), ('observation', 0.038), ('block', 0.038), ('di', 0.038), ('rates', 0.037), ('originated', 0.036), ('broken', 0.036), ('log', 0.035), ('computers', 0.032), ('infrastructure', 0.032), ('reliability', 0.032), ('optimizer', 0.031), ('lose', 0.031), ('propagation', 0.03), ('rate', 0.03), ('nocedal', 0.03), ('bayesian', 0.029), ('networks', 0.029), ('erent', 0.029), ('quickly', 0.029), ('optimization', 0.029), ('hidden', 0.029), ('cpu', 0.027), ('compatible', 0.027), ('accuracy', 0.027), ('http', 0.027), ('belief', 0.027), ('ng', 0.027), ('underlying', 0.026), ('cost', 0.025), ('fast', 0.025), ('loopy', 0.025), ('scale', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
Author: Emre Kiciman, David Maltz, John C. Platt
Abstract: Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approximate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a mean-field variational approximation and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours. 1
2 0.12767012 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
3 0.096013263 47 nips-2007-Collapsed Variational Inference for HDP
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
4 0.094413236 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio
Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.
5 0.089454323 213 nips-2007-Variational Inference for Diffusion Processes
Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor
Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1
6 0.085167661 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
7 0.083823226 22 nips-2007-Agreement-Based Learning
8 0.077031821 187 nips-2007-Structured Learning with Approximate Inference
9 0.068980351 189 nips-2007-Supervised Topic Models
10 0.06706398 214 nips-2007-Variational inference for Markov jump processes
11 0.058337048 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
12 0.057666626 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
13 0.057613578 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
14 0.056523561 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
15 0.054977484 145 nips-2007-On Sparsity and Overcompleteness in Image Models
16 0.054191578 84 nips-2007-Expectation Maximization and Posterior Constraints
17 0.052148126 97 nips-2007-Hidden Common Cause Relations in Relational Learning
18 0.051623546 182 nips-2007-Sparse deep belief net model for visual area V2
19 0.04923543 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
20 0.047070954 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
topicId topicWeight
[(0, -0.168), (1, 0.026), (2, -0.022), (3, -0.069), (4, -0.007), (5, -0.084), (6, -0.076), (7, -0.036), (8, 0.005), (9, -0.079), (10, 0.058), (11, 0.001), (12, 0.028), (13, -0.021), (14, 0.012), (15, 0.05), (16, -0.052), (17, 0.05), (18, -0.055), (19, 0.042), (20, 0.063), (21, -0.064), (22, -0.097), (23, 0.032), (24, -0.03), (25, 0.053), (26, -0.036), (27, -0.002), (28, 0.132), (29, -0.012), (30, -0.055), (31, -0.105), (32, 0.118), (33, -0.154), (34, -0.033), (35, 0.085), (36, 0.051), (37, -0.087), (38, 0.182), (39, -0.106), (40, -0.016), (41, 0.001), (42, 0.169), (43, -0.084), (44, 0.009), (45, 0.007), (46, 0.001), (47, -0.181), (48, 0.045), (49, 0.104)]
simIndex simValue paperId paperTitle
same-paper 1 0.94943249 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
Author: Emre Kiciman, David Maltz, John C. Platt
Abstract: Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approximate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a mean-field variational approximation and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours. 1
2 0.52132767 47 nips-2007-Collapsed Variational Inference for HDP
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
3 0.48243645 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
4 0.47107047 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio
Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.
5 0.46821359 214 nips-2007-Variational inference for Markov jump processes
Author: Manfred Opper, Guido Sanguinetti
Abstract: Markov jump processes play an important role in a large number of application domains. However, realistic systems are analytically intractable and they have traditionally been analysed using simulation based techniques, which do not provide a framework for statistical inference. We propose a mean field approximation to perform posterior inference and parameter estimation. The approximation allows a practical solution to the inference problem, while still retaining a good degree of accuracy. We illustrate our approach on two biologically motivated systems.
6 0.46617967 213 nips-2007-Variational Inference for Diffusion Processes
7 0.4175818 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
8 0.37896606 40 nips-2007-Bundle Methods for Machine Learning
9 0.37455776 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
10 0.36702794 61 nips-2007-Convex Clustering with Exemplar-Based Models
11 0.36104435 84 nips-2007-Expectation Maximization and Posterior Constraints
12 0.35882288 187 nips-2007-Structured Learning with Approximate Inference
13 0.34625694 63 nips-2007-Convex Relaxations of Latent Variable Training
14 0.33098054 19 nips-2007-Active Preference Learning with Discrete Choice Data
15 0.33053192 203 nips-2007-The rat as particle filter
16 0.32095596 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
17 0.32083809 59 nips-2007-Continuous Time Particle Filtering for fMRI
18 0.31153613 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
19 0.31138712 86 nips-2007-Exponential Family Predictive Representations of State
20 0.30666268 141 nips-2007-New Outer Bounds on the Marginal Polytope
topicId topicWeight
[(5, 0.042), (6, 0.294), (13, 0.044), (16, 0.056), (18, 0.034), (21, 0.072), (31, 0.036), (34, 0.022), (35, 0.031), (47, 0.082), (83, 0.117), (85, 0.021), (87, 0.021), (90, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.7630595 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
Author: Emre Kiciman, David Maltz, John C. Platt
Abstract: Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approximate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a mean-field variational approximation and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours. 1
2 0.5265944 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy
Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1
3 0.52243876 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
4 0.52204341 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
5 0.51988721 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
6 0.51972163 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
7 0.51963252 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
8 0.51816583 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.5177182 63 nips-2007-Convex Relaxations of Latent Variable Training
10 0.51729822 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
11 0.51709765 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
12 0.51707315 195 nips-2007-The Generalized FITC Approximation
13 0.51661587 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
14 0.51625192 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
15 0.51541817 86 nips-2007-Exponential Family Predictive Representations of State
16 0.51283151 174 nips-2007-Selecting Observations against Adversarial Objectives
17 0.511168 158 nips-2007-Probabilistic Matrix Factorization
18 0.5110386 115 nips-2007-Learning the 2-D Topology of Images
19 0.51053631 156 nips-2007-Predictive Matrix-Variate t Models
20 0.50979507 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods