nips nips2011 nips2011-262 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. [sent-5, score-0.285]
2 In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. [sent-7, score-0.13]
3 We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods. [sent-8, score-0.083]
4 , yn } from a p-variate Gaussian distribution, so that yi ∼ N (µ, Σ), the task is to estimate its inverse covariance matrix Σ−1 , also referred to as the precision or concentration matrix. [sent-15, score-0.215]
5 The non-zero pattern of this inverse covariance matrix Σ−1 can be shown to correspond to the underlying graph structure of the GMRF. [sent-16, score-0.215]
6 Accordingly, a line of recent papers [2, 8, 20] has proposed an estimator that minimizes the Gaussian negative log-likelihood regularized by the 1 norm of the entries (off-diagonal entries) of the inverse covariance matrix. [sent-18, score-0.245]
7 In [8, 2] a block coordinate descent method has been proposed which is called the graphical lasso or GLASSO for short. [sent-22, score-0.475]
8 Other recent algorithms proposed for this problem include PSM that uses projected subgradients [5], ALM using alternating linearization [14], IPM an inexact interior point method [11] and SINCO a greedy coordinate descent method [15]. [sent-23, score-0.435]
9 For typical high-dimensional statistical problems, optimization methods typically suffer sub-linear rates of convergence [1]. [sent-24, score-0.108]
10 This would be too expensive for the Gaussian MLE problem, since the 1 number of matrix entries scales quadratically with the number of nodes. [sent-25, score-0.084]
11 geometric) rates of convergence for the state-of-the-art methods listed above. [sent-28, score-0.108]
12 However, at most linear rates in turn become infeasible when the problem size is very large, with the number of nodes in the thousands and the number of matrix entries to be estimated in the millions. [sent-29, score-0.105]
13 Here we ask the question: can we obtain superlinear rates of convergence for the optimization problem underlying the 1 regularized Gaussian MLE? [sent-30, score-0.252]
14 The caveat is that they have at most linear rates of convergence [3]. [sent-33, score-0.108]
15 For superlinear rates, one has to consider second-order methods which at least in part use the Hessian of the objective function. [sent-34, score-0.106]
16 This barrier property would be lost under quadratic approximations so there is a danger that Newton-like updates will not yield positive-definite matrices, unless one explicitly enforces such a constraint in some manner. [sent-38, score-0.158]
17 We perform Newton steps that use iterative quadratic approximations of the Gaussian negative log-likelihood, but with three innovations that enable finessing the caveats detailed above. [sent-41, score-0.215]
18 As in recent methods [12, 9], we build on the observation that the Newton direction computation is a Lasso problem, and perform iterative coordinate descent to solve this Lasso problem. [sent-43, score-0.421]
19 However, the naive approach has an update cost of O(p2 ) for performing each coordinate descent update in the inner loop, which makes this resume infeasible for this problem. [sent-44, score-0.611]
20 Secondly, we use an Armijo-rule based step size selection rule to obtain a step-size that ensures sufficient descent and positive-definiteness of the next iterate. [sent-46, score-0.197]
21 Thirdly, we use the form of the stationary condition characterizing the optimal solution to then focus the Newton direction computation on a small subset of free variables, in a manner that preserves the strong convergence guarantees of second-order descent. [sent-47, score-0.293]
22 In Section 3, we present our algorithm that combines quadratic approximation, Newton’s method and coordinate descent. [sent-49, score-0.268]
23 , yn } of this random vector, so that the sample covariance matrix can be written as n n 1 1 S= yi . [sent-57, score-0.131]
24 The addition of such 1 regularization promotes sparsity in the inverse covariance matrix, and thus encourages sparse graphical model structure. [sent-63, score-0.258]
25 2 3 Quadratic Approximation Method Our approach is based on computing iterative quadratic approximations to the regularized Gaussian MLE objective f (X) in (2). [sent-65, score-0.23]
26 This objective function f can be seen to comprise of two parts, f (X) ≡ g(X) + h(X), where g(X) = − log det X + tr(SX) and h(X) = λ X 1. [sent-66, score-0.131]
27 Following the standard approach [17, 21] to building a quadratic approximation around any iterate Xt for such composite functions, we build the secondorder Taylor expansion of the smooth component g(X). [sent-68, score-0.155]
28 3]) is given by log det(Xt + ∆) ≈ −1 −1 −1 −1 1 log det Xt +tr(Xt ∆)− 2 tr(Xt ∆Xt ∆). [sent-71, score-0.099]
29 We introduce Wt = Xt and write the second-order approximation gXt (∆) to g(X) = g(Xt + ∆) as ¯ gXt (∆) = tr((S − Wt )∆) + (1/2) tr(Wt ∆Wt ∆) − log det Xt + tr(SXt ). [sent-72, score-0.099]
30 ¯ (4) We define the Newton direction Dt for the entire objective f (X) can then be written as the solution of the regularized quadratic program: Dt = arg min gXt (∆) + h(Xt + ∆). [sent-73, score-0.303]
31 ¯ ∆ (5) This Newton direction can be used to compute iterative estimates {Xt } for solving the optimization problem in (2). [sent-74, score-0.111]
32 In the sequel, we will detail three innovations which makes this resume feasible. [sent-75, score-0.116]
33 As in recent methods [12], we build on the observation that the Newton direction computation is a Lasso problem, and perform iterative coordinate descent to find its solution. [sent-77, score-0.421]
34 However, the naive approach has an update cost of O(p2 ) for performing each coordinate descent update in the inner loop, which makes this resume infeasible for this problem. [sent-78, score-0.611]
35 Secondly, we use an Armijo-rule based step size selection rule to obtain a step-size that ensures sufficient descent and positive-definiteness of the next iterate. [sent-80, score-0.197]
36 Thirdly, we use the form of the stationary condition characterizing the optimal solution to then focus the Newton direction computation on a small subset of free variables, in a manner that preserves the strong convergence guarantees of second-order descent. [sent-81, score-0.293]
37 It is straightforward to verify that for a symmetric matrix ∆ we have tr(Wt ∆Wt ∆) = vec(∆)T (Wt ⊗ Wt ) vec(∆), where ⊗ denotes the Kronecker product and vec(X) is the vectorized listing of the elements of matrix X. [sent-87, score-0.08]
38 In [7, 18] the authors show that coordinate descent methods are very efficient for solving lasso type problems. [sent-88, score-0.398]
39 However, an obvious way to update each element of ∆ to solve for the Newton direction in (5) needs O(p2 ) floating point operations since Q := Wt ⊗Wt is a p2 ×p2 matrix, thus yielding an O(p4 ) procedure for approximating the Newton direction. [sent-89, score-0.152]
40 As we show below, our implementation reduces the cost of one variable update to O(p) by exploiting the structure of Q or in other words the specific form of the second order term tr(Wt ∆Wt ∆). [sent-90, score-0.079]
41 ) Furthermore, we omit the use of a separate ¯ ¯ index for the coordinate descent updates. [sent-94, score-0.31]
42 Thus, we simply use D to denote the current iterate approximating the Newton direction and use D for the updated direction. [sent-95, score-0.106]
43 Consider the coordinate descent update for the variable Xij , with i < j that preserves symmetry: D = D+µ(ei eT +ej eT ). [sent-96, score-0.422]
44 The quadratic term can be rewritten using tr(AB) = tr(BA) and the symmetry of D and W to yield: T 2 4µwi Dwj + 2µ2 (Wij + Wii Wjj ). [sent-101, score-0.09]
45 tr(W D W D ) = tr(W DW D) + (7) In order to compute the single variable update we seek the minimum of the following function of µ: 1 T (W 2 + Wii Wjj )µ2 + (Sij − Wij + wi Dwj )µ + λ|Xij + Dij + µ|. [sent-102, score-0.127]
46 (8) 2 ij 2 T Letting a = Wij + Wii Wjj , b = Sij − Wij + wi Dwj , and c = Xij + Dij the minimum is achieved for: µ = −c + S(c − b/a, λ/a), (9) where S(z, r) = sign(z) max{|z| − r, 0} is the soft-thresholding function. [sent-103, score-0.112]
47 Instead, we maintain U = DW by updating two rows of the matrix U for every variable update in D costing O(p) flops, and then T compute wi uj using also O(p) flops. [sent-107, score-0.24]
48 Another way to view this arrangement is that we maintain a p decomposition W DW = k=1 wk uT throughout the process by storing the uk vectors, allowing k O(p) computation of update (9). [sent-108, score-0.118]
49 In order to maintain the matrix U we also need to update two coordinates of each uk when Dij is modified. [sent-109, score-0.119]
50 For instance, if we are starting from a diagonal matrix X0 , the terms wi Dwj equal Dij /((X0 )ii (X0 )jj ), which are independent of each other implying that we only need to update each variable according to (9) only once, and the resulting D will be the optimum of (5). [sent-112, score-0.199]
51 2 Computing the Step Size Following the computation of the Newton direction Dt , we need to find a step size α ∈ (0, 1] that ensures positive definiteness of the next iterate Xt + αDt and sufficient decrease in the objective function. [sent-115, score-0.173]
52 Following the line search −1 and the Newton step update Xt+1 = Xt + αDt we efficiently compute Wt+1 = Xt+1 by reusing the Cholesky decomposition of Xt+1 . [sent-124, score-0.079]
53 3 Identifying which variables to update In this section, we propose a way to select which variables to update that uses the stationary condition of the Gaussian MLE problem. [sent-126, score-0.232]
54 At the start of any outer loop computing the Newton direction, we partition the variables into free and fixed sets based on the value of the gradient. [sent-127, score-0.203]
55 For any Xt and the corresponding fixed and free sets Sf ixed , Sf ree , the optimized update on the fixed set would not change any of the coordinates. [sent-133, score-0.229]
56 In other words, the solution of the following optimization problem is ∆ = 0: arg min f (Xt + ∆) such that ∆ij = 0 ∀(i, j) ∈ Sf ree . [sent-134, score-0.078]
57 Based on the above observation, we perform the inner loop coordinate descent updates restricted to the free set only (to find the Newton direction). [sent-138, score-0.583]
58 This reduces the number of variables over which we perform the coordinate descent from O(p2 ) to the number of non-zeros in Xt , which in general is much smaller than p2 when λ is large and the solution is sparse. [sent-139, score-0.347]
59 We have observed huge computational gains from this modification, and indeed in our main theorem we show the superlinear convergence rate for the algorithm that includes this heuristic. [sent-140, score-0.151]
60 The attractive facet of this modification is that it leverages the sparsity of the solution and intermediate iterates in a manner that falls within a block coordinate descent framework. [sent-141, score-0.447]
61 Specifically, suppose as detailed above at any outer loop Newton iteration, we partition the variables into the fixed and free set, and then first perform a Newton update restricted to the fixed block, followed by a Newton update on the free block. [sent-142, score-0.503]
62 According to Lemma 1 a Newton update restricted to the fixed block does not result in any changes. [sent-143, score-0.188]
63 In other words, performing the inner loop coordinate descent updates restricted to the free set is equivalent to two block Newton steps restricted to the fixed and free sets consecutively. [sent-144, score-0.802]
64 Note further, that the union of the free and fixed sets is the set of all variables, which as we show in the convergence analysis in the appendix, is sufficient to ensure the convergence of the block Newton descent. [sent-145, score-0.341]
65 As the following lemma shows, if the limit of the iterates (the solution of the optimization problem) is sparse, then after a finite number of iterations, the iterates Xt would also have the same sparsity pattern. [sent-148, score-0.158]
66 If for some index pair (i, j), | ij g(X ∗ )| < λ (so that ∗ ¯ ¯ Xij = 0), then there exists a constant t > 0 such that for all t > t, the iterates Xt satisfy | ij g(Xt )| < λ and (Xt )ij = 0. [sent-151, score-0.188]
67 Note that | ij g(X ∗ )| < λ implying ∗ Xij = 0 follows from the optimality condition of (2). [sent-153, score-0.096]
68 1 2 3 4 5 6 7 Algorithm 1: Quadratic Approximation method for Sparse Inverse Covariance Learning (QUIC) Input : Empirical covariance matrix S, scalar λ, initial X0 , inner stopping tolerance Output: Sequence of Xt converging to arg minX 0 f (X), where f (X) = − log det X + tr(SX) + λ X 1 . [sent-160, score-0.308]
69 ¯ Partition the variables into free and fixed sets based on the gradient, see Section 3. [sent-166, score-0.147]
70 ¯ Use coordinate descent to find the Newton direction Dt = arg min∆ fXt (Xt + ∆) over the free variable set, see (6) and (9). [sent-168, score-0.531]
71 1 Convergence Guarantee We build upon the convergence analysis in [17, 21] of the block coordinate gradient descent method applied to composite objectives. [sent-177, score-0.496]
72 Specifically, [17, 21] consider iterative updates where at each 5 iteration t they update just a block of variables Jt . [sent-178, score-0.266]
73 Note that the condition (12) ensures that each block of variables will be updated at least once every T iterations. [sent-186, score-0.149]
74 Our Newton steps with the free set modification is a special case of this framework: we set J2t , J2t+1 to be the fixed and free sets respectively. [sent-187, score-0.22]
75 3, our selection of the fixed sets ensures that a block update restricted to the fixed set would not change any values since these variables in fixed sets already satisfy the coordinatewise optimality condition. [sent-189, score-0.26]
76 Thus, while our algorithm only explicitly updates the free set block, this is equivalent to updating variables in fixed and free blocks consecutively. [sent-190, score-0.292]
77 Note that in our case, the smooth component is the log-determinant function g(X) = − log det X + tr(SX), while the non-differentiable separable component is h(x) = λ x 1 . [sent-193, score-0.099]
78 2 Asymptotic Convergence Rate In addition to convergence, we further show that our algorithm has a quadratic asymptotic convergence rate. [sent-201, score-0.167]
79 From these two assertions, we build on the convergence rate result for constrained Newton methods in [6] to show that our method is quadratically convergent. [sent-208, score-0.121]
80 • G LASSO: the block coordinate descent method proposed by [8]. [sent-214, score-0.387]
81 • SINCO: the greedy coordinate descent method proposed by [15]. [sent-225, score-0.31]
82 • IPM: An inexact interior point method proposed by [11]. [sent-229, score-0.083]
83 Since some of the above implementations do not support the generalized regularization term λ ◦ X 1 , our comparisons use λ X 1 as the regularization term. [sent-236, score-0.122]
84 Inspection of the available Fortran implementation has revealed that a separate 6 Table 1: The comparisons on synthetic datasets. [sent-238, score-0.079]
85 p stands for dimension, Σ−1 0 indicates the number of nonzeros in ground truth inverse covariance matrix, X ∗ 0 is the number of nonzeros in the solution, and is a specified relative error of objective value. [sent-239, score-0.327]
86 Dataset setting pattern p Σ−1 0 chain 1000 2998 chain 4000 11998 chain 10000 29998 random 1000 10758 random 4000 41112 random 10000 91410 Parameter setting Time (in seconds) λ X∗ 0 QUIC ALM Glasso PSM IPM 10−2 0. [sent-243, score-0.096]
87 1 Comparisons on synthetic datasets We first compare the run times of the different methods on synthetic data. [sent-300, score-0.086]
88 We generate the two following types of graph structures for the underlying Gaussian Markov Random Fields: • Chain Graphs: The ground truth inverse covariance matrix Σ−1 is set to be Σ−1 = −0. [sent-301, score-0.215]
89 i,i • Graphs with Random Sparsity Structures: We use the procedure mentioned in Example 1 in [11] to generate inverse covariance matrices with random non-zero patterns. [sent-304, score-0.175]
90 Specifically, we first generate a sparse matrix U with nonzero elements equal to ±1, set Σ−1 to be U T U and then add a diagonal term to ensure Σ−1 is positive definite. [sent-305, score-0.133]
91 We control the number of nonzeros in U so that the resulting Σ−1 has approximately 10p nonzero elements. [sent-306, score-0.113]
92 Given the inverse covariance matrix Σ−1 , we draw a limited number, n = p/2 i. [sent-307, score-0.215]
93 Table 1 shows the results for timing comparisons in the synthetic datasets. [sent-314, score-0.079]
94 For chain graphs, we select λ so that the solution had the (approximately) correct number of nonzero elements. [sent-316, score-0.085]
95 To test the performance of algorithms on different parameters (λ), for random sparse pattern we test the speed under two values of λ, one discovers correct number of nonzero elements, and one discovers 5 times the number of nonzero elements. [sent-317, score-0.234]
96 Regularization paths for generalized linear models via coordinate descent. [sent-387, score-0.178]
97 An inexact interior point method for l1-reguarlized sparse covariance selection. [sent-407, score-0.214]
98 Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. [sent-429, score-0.218]
99 A coordinate gradient descent method for nonsmooth separable minimization. [sent-443, score-0.31]
100 A coordinate gradient descent method for l1-regularized convex minimization. [sent-471, score-0.31]
wordName wordTfidf (topN-words)
[('newton', 0.371), ('quic', 0.366), ('xt', 0.294), ('ipm', 0.183), ('psm', 0.183), ('alm', 0.181), ('coordinate', 0.178), ('glasso', 0.161), ('sinco', 0.16), ('tr', 0.133), ('descent', 0.132), ('mle', 0.121), ('niteness', 0.121), ('xij', 0.121), ('dwj', 0.114), ('gxt', 0.114), ('wt', 0.111), ('free', 0.11), ('dt', 0.106), ('det', 0.099), ('covariance', 0.091), ('quadratic', 0.09), ('lasso', 0.088), ('inverse', 0.084), ('dij', 0.082), ('update', 0.079), ('block', 0.077), ('convergence', 0.077), ('superlinear', 0.074), ('direction', 0.073), ('regularized', 0.07), ('sx', 0.069), ('resume', 0.069), ('scheinberg', 0.069), ('wjj', 0.069), ('appendix', 0.066), ('ij', 0.064), ('overwhelmingly', 0.06), ('fortran', 0.06), ('nonzeros', 0.06), ('iterates', 0.06), ('loop', 0.056), ('wij', 0.056), ('sf', 0.056), ('wii', 0.056), ('nonzero', 0.053), ('gaussian', 0.051), ('wi', 0.048), ('sec', 0.047), ('inexact', 0.047), ('innovations', 0.047), ('arabidopsis', 0.046), ('estrogen', 0.046), ('fxt', 0.046), ('gmrf', 0.046), ('dw', 0.045), ('quadratically', 0.044), ('discovers', 0.044), ('sij', 0.044), ('synthetic', 0.043), ('regularization', 0.043), ('linearization', 0.042), ('vec', 0.042), ('professor', 0.04), ('costing', 0.04), ('caveats', 0.04), ('ops', 0.04), ('ree', 0.04), ('superlinearly', 0.04), ('matrix', 0.04), ('sparse', 0.04), ('inner', 0.04), ('arrangement', 0.039), ('iterative', 0.038), ('lemma', 0.038), ('ej', 0.038), ('arg', 0.038), ('variables', 0.037), ('thirdly', 0.037), ('interior', 0.036), ('comparisons', 0.036), ('ensures', 0.035), ('updates', 0.035), ('leukemia', 0.035), ('infeasible', 0.034), ('preserves', 0.033), ('barrier', 0.033), ('iterate', 0.033), ('uj', 0.033), ('composite', 0.032), ('secondly', 0.032), ('implying', 0.032), ('chain', 0.032), ('restricted', 0.032), ('objective', 0.032), ('caching', 0.031), ('rates', 0.031), ('code', 0.031), ('rule', 0.03), ('friedman', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
2 0.18113627 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
Author: Oliver Stegle, Christoph Lippert, Joris M. Mooij, Neil D. Lawrence, Karsten M. Borgwardt
Abstract: Inference in matrix-variate Gaussian models has major applications for multioutput prediction and joint learning of row and column covariances from matrixvariate data. Here, we discuss an approach for efficient inference in such models that explicitly account for iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We find greater accuracy in recovering biological network structures and are able to better reconstruct the confounders. 1
3 0.18041016 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
4 0.1755666 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
5 0.17143978 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
6 0.15703684 220 nips-2011-Prediction strategies without loss
7 0.13789849 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.13703927 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
9 0.13497397 271 nips-2011-Statistical Tests for Optimization Efficiency
10 0.13094494 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.10979495 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
12 0.10885424 258 nips-2011-Sparse Bayesian Multi-Task Learning
13 0.10305926 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
14 0.097635165 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
15 0.095763393 4 nips-2011-A Convergence Analysis of Log-Linear Training
16 0.095743723 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
17 0.091957361 276 nips-2011-Structured sparse coding via lateral inhibition
18 0.085552581 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
19 0.083268613 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
20 0.082520105 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
topicId topicWeight
[(0, 0.235), (1, -0.087), (2, -0.037), (3, -0.198), (4, 0.005), (5, 0.065), (6, 0.056), (7, -0.001), (8, 0.015), (9, 0.022), (10, 0.225), (11, -0.092), (12, 0.02), (13, -0.001), (14, -0.084), (15, 0.113), (16, 0.029), (17, 0.095), (18, 0.079), (19, -0.064), (20, -0.042), (21, 0.096), (22, 0.09), (23, 0.036), (24, -0.086), (25, -0.099), (26, -0.007), (27, -0.072), (28, 0.027), (29, 0.032), (30, -0.049), (31, -0.09), (32, 0.065), (33, 0.002), (34, -0.032), (35, 0.027), (36, -0.046), (37, -0.007), (38, -0.163), (39, 0.062), (40, -0.052), (41, 0.001), (42, -0.023), (43, 0.099), (44, 0.09), (45, -0.058), (46, 0.009), (47, -0.053), (48, 0.16), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.95792907 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
2 0.71377259 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
Author: Tzu-kuo Huang, Jeff G. Schneider
Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1
3 0.66348106 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
4 0.64599794 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
5 0.64431095 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
6 0.62848127 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
7 0.59366602 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
8 0.57529503 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
9 0.51938057 220 nips-2011-Prediction strategies without loss
10 0.51650095 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
11 0.51531959 4 nips-2011-A Convergence Analysis of Log-Linear Training
12 0.49553549 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
13 0.4899914 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
14 0.48948434 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
15 0.4878535 271 nips-2011-Statistical Tests for Optimization Efficiency
16 0.47964564 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
17 0.44978726 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
18 0.44797865 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
19 0.44517151 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
20 0.44483551 143 nips-2011-Learning Anchor Planes for Classification
topicId topicWeight
[(0, 0.021), (4, 0.031), (20, 0.024), (26, 0.019), (31, 0.049), (33, 0.015), (43, 0.062), (45, 0.143), (57, 0.019), (74, 0.047), (83, 0.454), (99, 0.045)]
simIndex simValue paperId paperTitle
1 0.94206965 302 nips-2011-Variational Learning for Recurrent Spiking Networks
Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner
Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation confirms the method’s applicability to learning both stationary and temporal spike patterns. 1
2 0.93286735 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta
Abstract: In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The specific example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We define a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also refinement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer affiliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conflicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to find the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efficiently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be specified by experts. For instance, the influence of one variable on another, or the combined influence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the specification of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly specified by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Influence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] defines the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisfied: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to refine the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the definition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisfied, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this definition of the CPT for V is that a constraint is more likely to be satisfied if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully defined given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisfied (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisfied (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modified versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efficient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisfied (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difficult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisfied by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to find a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to find the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us fix the leak parameters and the link reliability from the first cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to find M MAP . To evaluate M MAP , we first compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 confirms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a fixed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to find the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to find parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identified with a symptom represented by the first test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model refinement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the first approach that exploits implicit constraints based on value of information assessments. Secondly it is the first approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restificar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative influences. In Proceedings of Twenty Third International Conference on Uncertainty in Artificial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratification in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, July 2000. 9
3 0.92672753 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
4 0.90338928 13 nips-2011-A blind sparse deconvolution method for neural spike identification
Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli
Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefficients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1
same-paper 5 0.87376815 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
6 0.70711005 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data
7 0.6750145 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
8 0.62130791 249 nips-2011-Sequence learning with hidden units in spiking neural networks
9 0.60874224 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm
10 0.56868458 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models
11 0.56632793 86 nips-2011-Empirical models of spiking in neural populations
12 0.56431103 219 nips-2011-Predicting response time and error rates in visual search
13 0.5621441 102 nips-2011-Generalised Coupled Tensor Factorisation
14 0.56171644 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
15 0.55604327 150 nips-2011-Learning a Distance Metric from a Network
16 0.5539766 217 nips-2011-Practical Variational Inference for Neural Networks
17 0.55082583 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
18 0.54741007 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
19 0.54596841 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
20 0.54518312 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment