jmlr jmlr2010 jmlr2010-111 knowledge-graph by maker-knowledge-mining

111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes


Source: pdf

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Electrical Engineering University of California Berkeley, CA 90095-1594, USA Editor: Martin Wainwright Abstract An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. [sent-5, score-0.474]

2 The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. [sent-6, score-0.521]

3 We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. [sent-8, score-0.321]

4 Keywords: graphical models, time series, topology selection, convex optimization 1. [sent-10, score-0.297]

5 Introduction We consider graphical models of autoregressive (AR) Gaussian processes p x(t) = − ∑ Ak x(t − k) + w(t), k=1 w(t) ∼ N (0, Σ) (1) where x(t) ∈ Rn , and w(t) ∈ Rn is Gaussian white noise. [sent-11, score-0.254]

6 This characterization allows us to include the conditional independence relations in an estimation problem by placing sparsity constraints on the inverse spectral density matrix. [sent-14, score-0.296]

7 S ONGSIRI AND VANDENBERGHE based on solving the convex optimization problem minimize − log det X00 + tr(CX) p−k subject to Yk = ∑ Xi,i+k , k = 0, 1, . [sent-18, score-0.229]

8 In this paper we consider the more general problem of estimating the model parameters and the topology of the graphical model. [sent-37, score-0.245]

9 The topology selection problem can be solved by enumerating all topologies, solving the ML estimation problem for each topology, and ranking them via informationtheoretic criteria such as the Akaike or Bayes information criteria (Eichler, 2006; Songsiri et al. [sent-38, score-0.22]

10 Recently, new heuristic methods for topology selection in large Gaussian graphical models have been developed. [sent-65, score-0.28]

11 (2008) in which the gradient projection method is applied to the dual problem. [sent-87, score-0.269]

12 The main purpose of this paper is to develop an efficient method for topology selection in AR models, based on augmenting the estimation problem (2) with a convex regularization term, similar to the ℓ1 -norm regularization used in (4). [sent-88, score-0.272]

13 In Section 3 we set up the topology selection problem as a regularized ML problem and discuss its properties. [sent-93, score-0.293]

14 We conclude in Section 6 with a discussion of gradient projection algorithms for solving large instances of the regularized ML estimation problem. [sent-95, score-0.229]

15 If X is an r × s block matrix with i, j block Xi j , and each block is square of order n, then PV (X) denotes the r × s block matrix with i, j block PV (X)i j = PV (Xi j ). [sent-159, score-0.24]

16 If we denote by V the set of index pairs i, j of conditionally independent variables, then we can use the projection operator P = PV defined in (7) to express the conditional independence relations as P(S(ω)−1 ) = 0. [sent-170, score-0.223]

17 This method is also known as the covariance method if C is the non-windowed sample covariance (13), and as the correlation method if C is the windowed sample covariance (14) (see Stoica and Moses, 1997). [sent-310, score-0.285]

18 A convex relaxation is minimize − log det X00 + tr(CX) subject to P(D(X)) = 0 (16) X 0 with variable X ∈ Sn(p+1) . [sent-313, score-0.229]

19 In many applications the topology is not known, and needs to be discovered from the data. [sent-338, score-0.185]

20 This topology selection method based on information-theoretic criteria is feasible if the number of possible topologies is not too large, but quickly becomes intractable even for small values of n. [sent-343, score-0.363]

21 1 Regularized ML Problem In analogy with the convex heuristic for covariance selection (4), we can formulate a regularized ML problem by adding a nonsmooth ℓ1 -type penalty: minimize − log det X00 + tr(CX) + γh(D(X)) subject to X 0, (21) where γ > 0 is a weighting parameter. [sent-346, score-0.461]

22 The penalty h : Mn,p → R is a convex function, chosen to encourage a sparse solution X with a common, symmetric sparsity pattern for the p + 1 blocks of D(X). [sent-347, score-0.263]

23 Regularization with a convex sum-of-norms penalty is a popular technique for achieving sparsity of groups of variables. [sent-356, score-0.216]

24 If we use a multiplier Z ∈ Mn,p for the equality constraint Y = D(X) and a multiplier U ∈ Sn(p+1) for the inequality X 0, the Lagrangian of the problem is L(X,Y, Z,U) = − log det X00 + tr(CX) + γh∞ (Y ) − tr(UX) + tr(Z T (D(X) −Y )) = − log det X00 + tr((C + T(Z) −U)X) + γh∞ (Y ) − tr(Z Y ). [sent-365, score-0.21]

25 (24) The minimization over the off-diagonal part of the blocks Yk decomposes into independent minimizations of the functions p − ∑ ((Zk )i j (Yk )i j + (Zk ) ji (Yk ) ji ) + γ max |(Y0 )i j |, max |(Yk )i j |, max |(Yk ) ji | k=1,. [sent-374, score-0.225]

26 This gives a third set of dual feasibility conditions: (C + T(Z) −U)00 ≻ 0, (C + T(Z) −U)i j = 0, (i, j) = 0, (26) and an expression for the dual function log det(C + T(Z) −U)00 + n (24), (25), (26) −∞ otherwise. [sent-386, score-0.226]

27 If we add a variable W = C00 + Z0 − U00 and eliminate the slack variable U, we can express the dual problem as maximize log detW + n subject to W 0 0 0 C + T(Z) (27) p ∑ (|(Zk )i j | + |(Zk ) ji |) ≤ γ, i= j k=0 diag(Zk ) = 0, k = 0, . [sent-388, score-0.223]

28 If a sum of ℓα -norms 1/α p hα (Y ) = ∑ ∑ (|(Yk )i j | j>i k=0 α α + |(Yk ) ji | ) (28) is used as penalty function in (21), the second constraint in the corresponding dual problem (27) is replaced by 1/β p ∑ β |(Zk )i j | + |(Zk ) ji | k=0 β ≤ γ, i= j with β = α/(α − 1). [sent-395, score-0.339]

29 It follows that the primal and dual problems are solvable, have equal optimal values, and that their solutions are characterized by the following set of necessary and sufficient optimality (or KKT) conditions. [sent-399, score-0.204]

30 The Lagrangian evaluated at the primal and dual optimal solutions is equal to the primal objective at the optimal X, Y , and equal to the dual objective evaluated at the optimal W , Z. [sent-408, score-0.408]

31 Equality between the Lagrangian and the dual objective requires that the primal optimal X, Y minimize the Lagrangian evaluated at the dual optimal W , Z. [sent-411, score-0.354]

32 Under these conditions, the optimization problem (21) is equivalent to a regularized (conditional) ML estimation problem for the model parameters B: minimize −2 log det B0 + tr(CBT B) + γh∞ (D(BT B)). [sent-426, score-0.215]

33 Examples with Randomly Generated Data Our interest in the regularized ML formulation (21) is motivated by the fact that the resulting AR model typically has a sparse inverse spectrum S(ω)−1 . [sent-428, score-0.266]

34 Since the regularized problem is convex, it is interesting as an efficient heuristic for topology selection. [sent-429, score-0.258]

35 1 C HOICE OF R EGULARIZATION PARAMETER γ The sparsity in the inverse spectrum of the solution of the regularized ML problem is controlled by the weighting coefficient γ. [sent-437, score-0.307]

36 on a statistical analysis of the probability of errors in the topology (see also Yuan and Lin, 2007; Banerjee et al. [sent-447, score-0.185]

37 In this approach, the convex heuristic is used as a preprocessing step to reduce the number of topologies that are examined using the BIC, and to filter out topologies that are unlikely to be competitive. [sent-456, score-0.26]

38 3 T HRESHOLDING With a proper value of γ, the regularized ML problem (21) has a sparse solution Y , resulting in a sparse inverse spectrum S(ω)−1 . [sent-472, score-0.313]

39 We will use the following method to determine the topology from the computed solution. [sent-474, score-0.185]

40 The normalized inverse spectrum R(ω) is known as the partial coherence (Brillinger, 1981; Dahlhaus, 2000). [sent-476, score-0.185]

41 To estimate the graph topology we compare the L∞ -norms of the entries of R(ω), ρi j = sup |R(ω)i j | ω with a given threshold. [sent-479, score-0.253]

42 This thresholding step is similar to thresholding in other sparse methods, for example the thresholded lasso and Dantzig estimators in Lounici (2008). [sent-480, score-0.183]

43 2 Experiment 1 In the first series of experiments we generate AR models with sparse inverse spectra by setting B0 = I and randomly choosing sparse lower triangular matrices Bk with entries ±0. [sent-483, score-0.287]

44 1 T OPOLOGY S ELECTION We first illustrate the basic topology selection method outlined above using the correct model order (p = 2). [sent-490, score-0.22]

45 For each of the nine sparsity patterns, we solve the ML problem subject to sparsity constraints (16). [sent-499, score-0.211]

46 15) and the corresponding topology is shown in Figure 5 (left). [sent-503, score-0.185]

47 The sparsity pattern on the right of Figure 5 is obtained from the covariance selection method with ℓ1 norm regularization (i. [sent-523, score-0.198]

48 Ignoring the model dynamics substantially increased the error in the topology selection. [sent-526, score-0.185]

49 ML estimation with conditional independence constraints determined by thresholding the partial coherence matrix of the least-squares estimate (solution 1). [sent-534, score-0.209]

50 The sparsity pattern from the regularized ML problem for a static model (p = 0). [sent-546, score-0.243]

51 ML estimation with conditional independence constraints determined by thresholding the inverse spectral density for the Tikhonov estimate (solution 3). [sent-557, score-0.276]

52 ML estimation with conditional independence constraints determined by thresholding the inverse spectral density for the h∞ -regularized ML estimate (solution 5). [sent-562, score-0.276]

53 For small and moderate N we also see that model 6 (ML estimate for the topology selected via nonsmooth regularization) is much more accurate than the other methods. [sent-568, score-0.234]

54 As can be seen, the total error in the estimated topology is reduced in the regularized estimates, and the errors decrease more rapidly when we regularize with the sum-of-norms penalty h∞ . [sent-593, score-0.334]

55 3 Experiment 2 In the second experiment we compare different penalty functions h for the regularized ML problem (21): the ‘sum-of-ℓ∞ -norms’ penalty h∞ defined in (22), the ‘sum-of-ℓ2 -norms’ penalty h2 defined in (28) with α = 2, and the ‘sum-of-ℓ1 -norms’ penalty h1 defined in (28) with α = 1. [sent-595, score-0.377]

56 These penalty functions all yield models with a sparse inverse spectrum S(ω)−1 = Y0 + 1 p −jkω ∑ (e Yk + ejkωYkT ), 2 k=1 but have different degrees of sparsity for the entries (Yk )i j within each group i, j. [sent-596, score-0.425]

57 The data are generated by randomly choosing sparse coefficients Yk of an inverse spectrum (10). [sent-597, score-0.193]

58 Table 1 shows the results of topology selection with the three penalties, for sample size N = 512 and averaged over 50 sample sequences. [sent-605, score-0.22]

59 Applications This section presents two examples of real data sets to demonstrate how topology selection can facilitate studies of relationship in multivariate time series. [sent-612, score-0.22]

60 The combined classification error computed as the sum of the false positives and false negatives divided by the number of upper triangular entries in the pattern. [sent-638, score-0.294]

61 1 Functional Magnetic Resonance Imaging (fMRI) Data In this section we apply the topology selection method to a functional magnetic resonance imaging (fMRI) time series. [sent-640, score-0.22]

62 53 Table 1: Accuracy of topology selection methods with penalty hα for α = 1, 2, ∞. [sent-706, score-0.296]

63 The table shows the average KL divergence with respect to the true model and the average percentage error in the estimated topology (defined as the sum of the false positives and false negatives divided by the number of upper triangular entries in the pattern), averaged over 50 instances. [sent-707, score-0.479]

64 The density is computed as the number of nonzero entries in the estimated inverse spectrum divided by n2 . [sent-743, score-0.214]

65 2 International Stock Market Data We consider a multivariate time series of 17 stock market indices: the S&P; 5000 composite index (U. [sent-769, score-0.27]

66 Figure 10 (right) shows ρi j , the maximum magnitude of the partial coherence of the model, and compares it with a thresholded nonparametric estimate obtained with Welch’s method (Proakis, 2001) and the constrained ML model with topology obtained by thresholding the least-squares estimate. [sent-785, score-0.328]

67 Middle: Constrained ML estimate with topology determined from the LS solution. [sent-789, score-0.185]

68 Right: Constrained ML estimate with topology determined from the h∞ -regularized ML estimate. [sent-790, score-0.185]

69 Overall the graphical model seems plausible, and the experiment suggests that the topology selection method is quite effective. [sent-799, score-0.28]

70 The regularized ML problem (21) also includes a nondifferentiable term in the objective, and its dual (27) has a differentiable objective but constraints that involve nondifferentiable functions. [sent-803, score-0.186]

71 (For the regularized ML problem (21), these difficulties could be addressed as in the covariance selection method of Banerjee et al. [sent-809, score-0.183]

72 75) BE PO GE NE CH IT SP Figure 11: A graphical model of stock market data. [sent-824, score-0.243]

73 the smoothing and bounding parameters, this algorithm was slower than the dual gradient projection method described in this section, so we will not pursue it here. [sent-827, score-0.269]

74 The reformulated dual problems are interesting because they can often be solved by gradient algorithms for unconstrained optimization or gradient projection algorithms for problems with simple constraints. [sent-848, score-0.343]

75 We first describe a version of the classical gradient projection with a backtracking line search (Polyak, 1987; Bertsekas, 1999). [sent-869, score-0.264]

76 1 BASIC G RADIENT P ROJECTION The basic gradient projection method starts at x(0) and continues the iteration x(k) = P x(k−1) − tk ∇ f (x(k−1) ) = x(k−1) − tk Gtk (x(k−1) ) (38) until a stopping criterion is satisfied. [sent-879, score-0.518]

77 A classical convergence result states that x(k) converges to an optimal solution if tk is fixed and equal to 1/L, where L is a constant that satisfies ∇ f (u) − ∇ f (v) 2 ≤ L u−v 2 ∀u, v ∈ S , (39) (Polyak, 1987, §7. [sent-880, score-0.181]

78 Although our assumptions (S is closed and bounded, and dom f is open) imply that the Lipschitz condition (39) holds for some constant L > 0, its value is not known in practice, so the fixed step size rule tk = 1/L cannot be used. [sent-883, score-0.22]

79 We therefore determine tk using a backtracking search (Beck and Teboulle, 2009). [sent-884, score-0.289]

80 The step size search algorithm in iteration k starts ¯ at a value tk := tk where sT s ¯ tk = min T ,tmax , (40) s y where s = x(k−1) − x(k−2) , y = ∇ f (x(k−1) ) − ∇ f (x(k−2) ), and tmax is a positive constant. [sent-885, score-0.588]

81 ) The search then repeats the update tk := βtk (where β ∈ (0, 1) is an algorithm parameter) until x(k−1) − tk Gtk (x(k−1) ) ∈ dom f and f (x(k−1) − tk Gtk (x(k−1) )) ≤ f (x(k−1) ) − tk ∇ f (x(k−1) )T Gtk (x(k−1) ) + tk Gtk (x(k−1) ) 2 . [sent-887, score-0.989]

82 Note that the trial points x(k−1) − tk Gtk (x(k−1) ) = P x(k−1) − tk ∇ f (x(k−1) ) generated during the step size search are not necessarily on a straight line. [sent-889, score-0.443]

83 For functions whose gradient is Lipschitz continuous on C , these algorithms have a better complexity than the classical gradient projection method (at most O( 1/ε) iterations are needed to reach an accuracy ε, as opposed to O(1/ε) for the gradient projection method). [sent-907, score-0.386]

84 These theoretical complexity results are valid if a constant step size tk = 1/L is used where L is the Lipschitz constant for the gradient, or if the step sizes form an nonincreasing sequence (tk+1 ≤ tk ) determined by a backtracking line search (Beck and Teboulle, 2009; Tseng, 2008). [sent-908, score-0.47]

85 The assumption that the gradient is Lipschitz continuous on C does not hold for the problem considered here, and it is not clear if the convergence analysis can be extended to the case when the gradient is Lipschitz continuous only on the initial sublevel set. [sent-909, score-0.199]

86 3 I MPLEMENTATION D ETAILS The most important steps in the gradient projection algorithms applied to (33) are the evaluations of the gradient of the objective function and the projections on the set defined by the constraints. [sent-913, score-0.23]

87 The duality gap between this primal feasible X and the dual feasible Z, W is η = − log det X00 + tr(CX) + γh(D(X)) − log detW − n = tr(CX) − n + γh(D(X)) = tr((C + T(Z))X) − n − tr(X T(Z)) + γh(D(X)) = − tr(X T(Z)) + γh(D(X)). [sent-932, score-0.437]

88 The true inverse spectrum has 10428 nonzero entries in the upper triangular part (a density of about 12%). [sent-938, score-0.275]

89 We start the gradient projection algorithm at a strictly feasible Z (0) = 0, and terminate when the duality gap is below 10−2 (the optimal value is on the order of hundreds). [sent-942, score-0.245]

90 ‘GP with arc search’ refers to the gradient projection method (38) with step size rule (41). [sent-945, score-0.227]

91 The ‘Exact FISTA’ method is the gradient projection algorithm with backtracking line search from Beck and Teboulle (2009) using monotonically decreasing step sizes (tk ≤ tk−1 , as required by the theory in Beck and Teboulle 2009). [sent-953, score-0.264]

92 Although the backtracking steps in the arc search method are more expensive, the gradient projection method with this step size selection required fewer iterations in most cases. [sent-964, score-0.37]

93 Conclusion We have presented a convex optimization method for topology selection in graphical models of autoregressive Gaussian processes. [sent-966, score-0.526]

94 The method is based on augmenting the maximum likelihood estimation problem with an ℓ1 -type penalty function, chosen to promote sparsity in the inverse spectrum. [sent-967, score-0.228]

95 To solve the large, nonsmooth convex optimization problems that result from this formulation, we have investigated a gradient projection method applied to a reformulated dual problem. [sent-970, score-0.37]

96 Experiments with randomly generated examples, and an analysis of an fMRI time series and a time series of international stock market indices were included to confirm the effectiveness of this approach. [sent-971, score-0.183]

97 Maximum-likelihood estimation of multivariate normal graphical models: large-scale numerical implementation and topology selection. [sent-1057, score-0.245]

98 Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. [sent-1111, score-0.193]

99 Adaptive first-order methods for general sparse inverse covariance selection. [sent-1165, score-0.186]

100 SINCO - a greedy coordinate ascent method for sparse inverse covariance selection problem. [sent-1242, score-0.221]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ml', 0.378), ('yk', 0.213), ('ongsiri', 0.203), ('autoregressive', 0.194), ('opology', 0.194), ('topology', 0.185), ('tk', 0.181), ('raphical', 0.173), ('zk', 0.154), ('rocesses', 0.144), ('songsiri', 0.143), ('vandenberghe', 0.142), ('stock', 0.128), ('election', 0.127), ('tikhonov', 0.121), ('pv', 0.119), ('tr', 0.114), ('dual', 0.113), ('odels', 0.109), ('ar', 0.108), ('dahlhaus', 0.107), ('det', 0.105), ('topologies', 0.104), ('sn', 0.102), ('ls', 0.101), ('gp', 0.092), ('primal', 0.091), ('sparsity', 0.088), ('eichler', 0.083), ('spectrum', 0.082), ('projection', 0.082), ('static', 0.082), ('bic', 0.08), ('fmri', 0.076), ('penalty', 0.076), ('ji', 0.075), ('covariance', 0.075), ('gradient', 0.074), ('regularized', 0.073), ('aicc', 0.072), ('fista', 0.072), ('gtk', 0.072), ('arc', 0.071), ('entries', 0.068), ('thresholding', 0.068), ('cx', 0.064), ('inverse', 0.064), ('backtracking', 0.063), ('triangular', 0.061), ('graphical', 0.06), ('windowed', 0.06), ('nesterov', 0.056), ('beck', 0.055), ('market', 0.055), ('conditional', 0.054), ('curve', 0.054), ('convex', 0.052), ('sublevel', 0.051), ('duality', 0.05), ('nonsmooth', 0.049), ('banerjee', 0.049), ('composite', 0.048), ('block', 0.048), ('birgin', 0.048), ('independence', 0.048), ('lagrangian', 0.047), ('incorrectly', 0.047), ('bt', 0.047), ('sparse', 0.047), ('bk', 0.046), ('mle', 0.045), ('search', 0.045), ('false', 0.043), ('teboulle', 0.042), ('spectral', 0.042), ('detw', 0.041), ('positives', 0.04), ('dom', 0.039), ('feasible', 0.039), ('index', 0.039), ('negatives', 0.039), ('coherence', 0.039), ('minimize', 0.037), ('scheinberg', 0.037), ('stimulus', 0.037), ('constrained', 0.036), ('gt', 0.036), ('straight', 0.036), ('barzilai', 0.036), ('brillinger', 0.036), ('cbt', 0.036), ('dahl', 0.036), ('feiler', 0.036), ('jitkomut', 0.036), ('rish', 0.036), ('salvador', 0.036), ('timmer', 0.036), ('lu', 0.035), ('subject', 0.035), ('selection', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

2 0.19195464 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

3 0.14272739 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.11867193 10 jmlr-2010-An Exponential Model for Infinite Rankings

Author: Marina Meilă, Le Bao

Abstract: This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that infinite models over permutations can be simple, elegant and practical. Keywords: permutations, partial orderings, Mallows model, distance based ranking model, exponential family, non-parametric clustering, branch-and-bound

5 0.10908031 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models

6 0.10292453 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity

7 0.096763581 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

8 0.082772233 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

9 0.079631075 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

10 0.079442836 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

11 0.077164397 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

12 0.075226188 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

13 0.074562021 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

14 0.073388547 84 jmlr-2010-On Spectral Learning

15 0.072342075 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

16 0.0690789 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

17 0.067196816 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.066412427 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

19 0.064461246 109 jmlr-2010-Stochastic Composite Likelihood

20 0.063229054 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.291), (1, -0.069), (2, 0.114), (3, -0.053), (4, -0.11), (5, -0.001), (6, 0.232), (7, -0.025), (8, -0.166), (9, -0.014), (10, -0.165), (11, -0.036), (12, 0.01), (13, -0.219), (14, 0.009), (15, -0.133), (16, 0.289), (17, 0.139), (18, 0.019), (19, -0.039), (20, 0.037), (21, -0.008), (22, 0.011), (23, -0.011), (24, 0.046), (25, 0.178), (26, 0.03), (27, 0.082), (28, 0.085), (29, -0.11), (30, -0.083), (31, -0.042), (32, 0.045), (33, 0.114), (34, 0.049), (35, -0.04), (36, 0.029), (37, -0.005), (38, -0.092), (39, 0.017), (40, -0.02), (41, -0.041), (42, 0.064), (43, -0.108), (44, -0.03), (45, -0.055), (46, 0.004), (47, 0.061), (48, -0.016), (49, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94401491 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

2 0.61310118 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

3 0.58917922 10 jmlr-2010-An Exponential Model for Infinite Rankings

Author: Marina Meilă, Le Bao

Abstract: This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that infinite models over permutations can be simple, elegant and practical. Keywords: permutations, partial orderings, Mallows model, distance based ranking model, exponential family, non-parametric clustering, branch-and-bound

4 0.43546852 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.41652074 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity

Author: Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer

Abstract: Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR’s. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose computationally efficient methods for estimating the model, as well as methods to assess the significance of the causal influences. The model is successfully applied on financial and brain imaging data. Keywords: structural vector autoregression, structural equation models, independent component analysis, non-Gaussianity, causality

6 0.38619545 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

7 0.36409864 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

8 0.35583904 109 jmlr-2010-Stochastic Composite Likelihood

9 0.34785941 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

10 0.32948396 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

11 0.32796824 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

12 0.31328532 84 jmlr-2010-On Spectral Learning

13 0.30731711 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

14 0.30480206 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

15 0.30224505 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

16 0.29871666 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

17 0.28882304 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

18 0.28268507 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

19 0.27408808 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

20 0.27208644 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.016), (4, 0.013), (8, 0.035), (14, 0.317), (15, 0.016), (21, 0.024), (32, 0.088), (33, 0.018), (36, 0.048), (37, 0.058), (75, 0.149), (81, 0.029), (85, 0.067), (96, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70460886 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

2 0.52388966 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

3 0.52106529 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models

4 0.51743513 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.51652014 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.51380688 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

7 0.51261729 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

8 0.51243466 109 jmlr-2010-Stochastic Composite Likelihood

9 0.50970215 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

10 0.50968802 63 jmlr-2010-Learning Instance-Specific Predictive Models

11 0.50882471 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

12 0.50679809 102 jmlr-2010-Semi-Supervised Novelty Detection

13 0.50517201 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

14 0.50434375 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

15 0.50274587 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

16 0.50202554 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

17 0.50126433 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

18 0.50118744 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

19 0.50019848 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

20 0.49962097 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory