nips nips2010 nips2010-108 knowledge-graph by maker-knowledge-mining

108 nips-2010-Graph-Valued Regression

Source: pdf

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

Abstract: Undirected graphical models encode in a graph G the dependency structure of a random vector Y . In many applications, it is of interest to model Y given another random vector X as input. We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classiﬁcation and regression trees), but at each leaf of the tree estimates a graph. We call the method “Graph-optimized CART”, or GoCART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Graph-Valued Regression Han Liu Xi Chen John Lafferty Larry Wasserman Carnegie Mellon University Pittsburgh, PA 15213 Abstract Undirected graphical models encode in a graph G the dependency structure of a random vector Y . [sent-1, score-0.149]

2 We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. [sent-3, score-0.145]

3 In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classiﬁcation and regression trees), but at each leaf of the tree estimates a graph. [sent-4, score-0.354]

4 We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. [sent-6, score-1.036]

5 We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. [sent-7, score-0.146]

6 A common way to study the structure of P is to construct the undirected graph G = (V, E), where the vertex set V corresponds to the p components of the vector Y . [sent-9, score-0.161]

7 In particular, let G(x) be the undirected graph corresponding to P (· | X = x). [sent-14, score-0.161]

8 Then G induces a partition of X , denoted as X1 , . [sent-17, score-0.328]

9 , Xm , where x1 and x2 lie in the same partition element if and only if G(x1 ) = G(x2 ). [sent-20, score-0.318]

10 Graph-valued regression is thus the problem of estimating the partition and estimating the graph within each partition element. [sent-21, score-0.799]

11 We present three different partition-based graph estimators; two that use global optimization, and one based on a greedy splitting procedure. [sent-22, score-0.238]

12 One of the optimization based schemes uses penalized empirical risk minimization, the other uses held-out risk minimization. [sent-23, score-0.399]

13 As we show, both methods enjoy strong theoretical properties under relatively weak assumptions; in particular, we establish oracle inequalities on the excess risk of the estimators, and tree partition consistency (under stronger assumptions) in Section 4. [sent-24, score-0.681]

14 In Section 5 we present experimental results on both synthetic data and a meteorological dataset, demonstrating how graph-valued regression can be an effective tool for analyzing high dimensional data with covariates. [sent-27, score-0.146]

15 One way to estimate G from the sample is the graphical lasso or glasso [13, 5, 1], where one assumes that P is Gaussian with mean µ and covariance matrix Σ. [sent-33, score-0.293]

16 Missing edges in the graph correspond to zero elements in the precision matrix Ω = Σ−1 [12, 4, 7]. [sent-34, score-0.216]

17 A sparse estimate of Ω is obtained by solving Ω = arg min tr(SΩ) − log |Ω| + λ Ω 1 (1) Ω 0 where Ω is positive deﬁnite, S is the sample covariance matrix, and Ω 1 = j,k |Ωjk | is the elementwise 1 -norm of Ω. [sent-35, score-0.164]

18 In practice, it seems that the glasso yields reasonable graph estimators even if Y is not Gaussian; however, proving conditions under which this happens is an open problem. [sent-41, score-0.294]

19 We can estimate ΣX , ΣY , and ΣXY by their corresponding sample ΣY X ΣY quantities ΣX , ΣY , and ΣXY , and the marginal precision matrix of X, denoted as ΩX , can be estimated using the glasso. [sent-46, score-0.182]

20 In particular, the conditional covariance matrix of Y | X is ΣY |X = ΣY − ΣY X ΩX ΣXY and a sparse estimate of ΩY |X can be obtained by directly plugging ΣY |X into glasso. [sent-48, score-0.185]

21 However, the estimated graph does not vary with different values of X. [sent-49, score-0.164]

22 h h i=1 i=1 Now we apply glasso in (1) with S = Σ(x) to obtain an estimate of G(x). [sent-56, score-0.176]

23 In this approach, we partition X into ﬁnitely many connected regions X1 , . [sent-60, score-0.288]

24 Within each Xj , we apply the glasso to get an estimated graph Gj . [sent-64, score-0.308]

25 We take the partition elements to be recursively deﬁned hyperrectangles. [sent-67, score-0.288]

26 As is well-known, we can then represent the partition by a tree, where each leaf node corresponds to a single partition element. [sent-68, score-0.691]

27 In CART, the leaves are associated with the means within each partition element; while in our case, there will be an estimated undirected graph for each leaf node. [sent-69, score-0.566]

28 The task of graph-valued regression is to ﬁnd a sparse inverse covariance Ω(x) to estimate Ω(x) for any x ∈ X ; in some situations the graph of Ω(x) is of greater interest than the entries of Ω(x) themselves. [sent-90, score-0.294]

29 We partition X into ﬁnitely many connected regions X1 , . [sent-92, score-0.288]

30 , Xm , and within each Xj we apply the glasso to estimate a graph Gj . [sent-95, score-0.291]

31 To ﬁnd the partition, we restrict ourselves to dyadic splits, as studied by [11, 2]. [sent-97, score-0.303]

32 The primary reason for such a choice is the computational and theoretical tractability of dyadic partition based estimators. [sent-98, score-0.591]

33 Let T denote the set of dyadic partitioning trees (DPTs) deﬁned over X = [0, 1]d , where each DPT T ∈ T is constructed by recursively dividing X by means of axis-orthogonal dyadic splits. [sent-99, score-0.679]

34 If a node is associated to the hyperd rectangle A = l=1 [al , bl ], then after being dyadically split along dimension k, the two children (k) are associated with the sub-hyperrectangles AL = l k [al , bl ] and 2 (k) (k) AR = A\AL . [sent-101, score-0.239]

35 , XmT } the partition of X induced by the leaf nodes of T . [sent-105, score-0.392]

36 For a dyadic integer N = 2K , we deﬁne TN to be the collection of all DPTs such that no partition has a side length smaller than 2−K . [sent-106, score-0.591]

37 We denote µT (x) and ΩT (x) as the piecewise constant mean and precision functions associated with T : mT mT µXj · I (x ∈ Xj ) and ΩT (x) = µT (x) = j=1 ΩXj · I (x ∈ Xj ) , j=1 where µXj ∈ Rp and ΩXj ∈ Rp×p are the mean vector and precision matrix for Xj . [sent-108, score-0.177]

38 The penalized empirical risk minimization Go-CART estimator is deﬁned as T , µXj , ΩXj b b mT b j=1 = argminT ∈TN ,µX j ∈Mj ,ΩXj ∈Λj R(T, µT , ΩT ) + pen(T ) [[T ]] log 2+2 log(np) . [sent-120, score-0.346]

39 n where R is deﬁned in (3) and pen(T ) = γn · mT 3 (5) Empirically, we may always set the dyadic integer N to be a reasonably large value; the regularization parameter γn is responsible for selecting a suitable DPT T ∈ TN . [sent-121, score-0.303]

40 The held-out negative log-likelihood risk is then given by Rout (T, µT , ΩT ) = 1 n2 n2 mT tr ΩXj (yi − µXj )(yi − µXj )T − log |ΩXj | · I (xi ∈ Xj ) . [sent-130, score-0.247]

41 The held-out risk minimization Go-CART estimator is T = argminT ∈TN Rout (T, µT , ΩT ). [sent-136, score-0.26]

42 The above procedures require us to ﬁnd an optimal dyadic partitioning tree within TN . [sent-138, score-0.454]

43 We focus on the held-out risk minimization form as in Deﬁnition 2, due to its superior empirical performance. [sent-141, score-0.213]

44 But note that our greedy approach is generic and can easily be adapted to the penalized empirical risk minimization form. [sent-142, score-0.331]

45 First, consider the simple case that we are given a dyadic tree structure T which induces a partition Π(T )={X1 , . [sent-143, score-0.735]

46 For any partition element Xj , we estimate the sample mean using D1 : µXj = n1 i=1 1 I (xi ∈ Xj ) n1 yi · I (xi ∈ Xj ) . [sent-147, score-0.404]

47 i=1 The glasso is then used to estimate a sparse precision matrix ΩXj . [sent-148, score-0.318]

48 More precisely, let ΣXj be the sample covariance matrix for the partition element Xj , given by ΣXj = 1 n1 I (xi ∈ Xj ) i=1 n1 yi − µXj yi − µXj T · I (xi ∈ Xj ) . [sent-149, score-0.509]

49 In practice, we run the full regularization path of the glasso, from large λj , which yields very sparse graph, to small λj , and select the graph that minimizes the held-out negative log-likelihood risk. [sent-151, score-0.156]

50 To further improve the model selection performance, we reﬁt the parameters of the precision matrix after the graph has been selected. [sent-152, score-0.216]

51 That is, to reduce the bias of the glasso, we ﬁrst estimate the sparse precision matrix using 1 -regularization, and then we reﬁt the Gaussian model without 1 -regularization, but enforcing the sparsity pattern obtained in the ﬁrst step. [sent-153, score-0.174]

52 The natural, standard greedy procedure starts from the coarsest partition X = [0, 1]d and then computes the decrease in the held-out risk by dyadically splitting each hyperrectangle A along dimension k ∈ {1, . [sent-154, score-0.703]

53 The dimension k∗ that results in the largest decrease in held-out risk is selected, where the change in risk is given by (k) (k) (k) ∆Rout (A, µA , ΩA ) = Rout (A, µA , ΩA ) − Rout (AL , µA(k) , ΩA(k) ) − Rout (AR , µA(k) , ΩA(k) ). [sent-158, score-0.381]

54 L L R R If splitting any dimension k of A leads to an increase in the held-out risk, the element A should no longer be split and hence becomes a partition element of Π(T ). [sent-159, score-0.475]

55 This greedy partitioning method parallels the classical algorithms for classiﬁcation and regression that have been used in statistical learning for decades. [sent-161, score-0.16]

56 4 4 Theoretical Properties We deﬁne the oracle risk R∗ over TN as R∗ = R(T ∗ , µ∗ , Ω∗ ) = T T inf T ∈TN ,µXj ∈Mj ,ΩXj ∈Λj R(T, µT , ΩT ). [sent-164, score-0.295]

57 Note that T ∗ , µ∗ ∗ , and Ω∗ ∗ might not be unique, since the ﬁnest partition always achieves the oracle T T risk. [sent-165, score-0.344]

58 Let T ∈ TN be an arbitrary DPT which induces a partition Π(T ) = {X1 , . [sent-168, score-0.328]

59 Let T ∈ TN be a DPT that induces a partition Π(T ) = {X1 , . [sent-183, score-0.328]

60 For any δ ∈ (0, 1/4), let T , µT , ΩT be the estimator obtained using the penalized empirical risk minimizab b tion Go-CART in Deﬁnition 1, with a penalty term pen(T ) of the form [[T ]] log 2 + 2 log p + log(48/δ) pen(T ) = (C1 + 1)Ln mT n √ √ where C1 = 8 v2 + 8B v1 + B 2 . [sent-187, score-0.339]

61 Then for sufﬁciently large n, the excess risk inequality R(T , µT , ΩT ) − R∗ ≤ inf b b T ∈TN 2pen(T ) + inf µXj ∈Mj ,ΩXj ∈Λj (R(T, µT , ΩT ) − R∗ ) holds with probability at least 1 − δ. [sent-188, score-0.34]

62 A similar oracle inequality holds when using the held-out risk minimization Go-CART. [sent-189, score-0.269]

63 Let T ∈ TN be a DPT which induces a partition Π(T ) = {X1 , . [sent-191, score-0.328]

64 Let T , µT , ΩT be the estimator constructed using the held-out risk minimization criterion of b b Deﬁnition 2. [sent-203, score-0.26]

65 Then, for sufﬁciently large n, the excess risk inequality R(T , µT , ΩT ) − R∗ ≤ inf b b T ∈TN 3φn (T ) + inf µXj ∈Mj ,ΩXj ∈Λj (R(T, µT , ΩT ) − R∗ ) + φn (T ) with probability at least 1 − δ. [sent-204, score-0.34]

66 We now temporarily make the strong assumption that the model is correct, so that Y given X is conditionally Gaussian, with a partition structure that is given by a dyadic tree. [sent-207, score-0.618]

67 We show that with high probability, the true dyadic partition structure can be correctly recovered. [sent-208, score-0.591]

68 The true model is Y | X = x ∼ Np (µ∗ ∗ (x), Ω∗ ∗ (x)) T T ∗ where T ∈ TN is a DPT with induced partition Π(T ∗ ) = {Xj∗ }mT ∗ and j=1 µ∗ ∗ (x) = T mT ∗ µ∗ I(x ∈ Xj∗ ), j Ω∗ ∗ (x) = T j=1 Ω∗ I(x ∈ Xj∗ ). [sent-210, score-0.324]

69 A tree estimation procedure T is tree partition consistent in case P Π(T ∗ ) ⊂ Π(T ) → 1 as n → ∞. [sent-214, score-0.496]

70 Note that the estimated partition may be ﬁner than the true partition. [sent-215, score-0.337]

71 Establishing a tree partition consistency result requires further technical assumptions. [sent-216, score-0.392]

72 The following assumption speciﬁes that for arbitrary adjacent subregions of the true dyadic partition, either the means or the variances should be sufﬁciently different. [sent-217, score-0.508]

73 Let Xi∗ and Xj∗ be adjacent partition elements of T ∗ , so that they have a common parent node within T ∗ . [sent-220, score-0.335]

74 Moreover, the Go-CART estimator in both the penalized risk minimization and held-out risk minimization form is tree partition consistent. [sent-230, score-0.918]

75 inf Π(T ∗ ) inf Π(T ) µT , ΩT ∈MT R(T, µT , ΩT ) − R(T ∗ , µ∗ ∗ , Ω∗ ∗ ) > min{ T T This result shows that, with high probability, we obtain a ﬁner partition than T ∗ ; the assumptions do not, however, control the size of the resulting partition. [sent-231, score-0.42]

76 5 Experiments We now present the performance of the greedy partitioning algorithm of Section 3 on both synthetic data and a real meteorological dataset. [sent-233, score-0.21]

77 In the experiment, we always set the dyadic integer N = 210 to ensure that we can obtain ﬁne-tuned partitions of the input space X . [sent-234, score-0.342]

78 (a) Estimated dyadic tree structure; (b) Ground true partition. [sent-289, score-0.407]

79 The number labeled on each subregion corresponds to each leaf node ID of the tree in (a); (c) The held-out negative log-likelihood risk for each split. [sent-293, score-0.518]

80 The order of the splits corresponds the ID of the tree node (from small to large). [sent-294, score-0.178]

81 hypercube into 22 subregions as shown in Figure 1 (b). [sent-295, score-0.21]

82 For the t-th subregion where 1 ≤ t ≤ 22, we generate an Erd¨ s-R´ nyi random graph Gt = (V t , E t ) with the number of vertices p = 20, o e the number of edges |E| = 10 and the maximum node degree is four. [sent-296, score-0.288]

83 The learned dyadic tree structure and its induced partition are presented in Figure 1. [sent-302, score-0.731]

84 Even though the graphs within each subregion are sparse, the estimated graph obtained by pooling all the data together is highly dense. [sent-306, score-0.361]

85 As the greedy algorithm proceeds, the estimated graphs become sparser and sparser. [sent-307, score-0.212]

86 Out of the 82 simulations where we correctly identify the tree structure, we list the graph estimation performance for subregions 28, 29, 13, 14, 5, 6 in terms of precision, recall, and F1-score in Table 1. [sent-309, score-0.397]

87 Table 1: The graph estimation performance over different subregions Mean values over 100 runs (Standard deviation) subregion region 28 region 29 region 13 region 14 region 5 region 6 Precision Recall F1 − score 0. [sent-310, score-0.611]

88 We also plot the held-out risk in the subplot (c). [sent-351, score-0.212]

89 The whole risk curve illustrates a diminishing return behavior. [sent-353, score-0.173]

90 Correctly splitting the large rectangle leads to a signiﬁcant decrease in the risk; in contrast, splitting the middle rectangles does not reduce the risk as much. [sent-354, score-0.315]

91 (a) Learned partitions for the 100 locations and projected to the US map, with the estimated graphs for subregions 3, 10, and 33; (b) Estimated graph with data pooled from all 100 locations; (c) the re-scaled partition pattern induced by the learned dyadic tree structure. [sent-357, score-1.267]

92 As a baseline, we estimate a sparse graph on the data pooled from all 100 locations, using the glasso algorithm; the estimated graph is shown in Figure 2 (b). [sent-368, score-0.542]

93 This apparently contradicts the basic domain knowledge that CO2 should be correlated with the solar radiation factors (including GLO, DIR), according to the IPCC report [6] which is one of the most authoritative reports in the ﬁeld of meteorology. [sent-370, score-0.156]

94 Treating the longitude and latitude of each site as two-dimensional covariate X, and the meteorology data of the p = 15 factors as the response Y , we estimate a dyadic tree structure using the greedy algorithm. [sent-372, score-0.543]

95 The result is a partition with 66 subregions, shown in Figure 2. [sent-373, score-0.288]

96 The graphs for subregions 3 and 10 (corresponding to the coast of California and Arizona states) are shown in subplot (a) of Figure 2. [sent-374, score-0.288]

97 The graphs for these two adjacent subregions are quite similar, suggesting spatial smoothness of the learned graphs. [sent-375, score-0.249]

98 Moreover, for both graphs, CO2 is connected to the solar radiation factor GLO through CH4 . [sent-376, score-0.156]

99 In contrast, for subregion 33, which corresponds to the north part of Arizona, the estimated graph is quite different. [sent-377, score-0.29]

100 In general, it is found that the graphs corresponding to the locations along the coasts are sparser than those corresponding to the locations in the mainland. [sent-378, score-0.174]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xj', 0.318), ('dyadic', 0.303), ('mt', 0.3), ('partition', 0.288), ('dpt', 0.223), ('tn', 0.197), ('subregions', 0.178), ('risk', 0.173), ('rout', 0.156), ('glasso', 0.144), ('glo', 0.137), ('wet', 0.134), ('subregion', 0.126), ('graph', 0.115), ('cart', 0.112), ('cld', 0.111), ('dtr', 0.111), ('tmn', 0.111), ('tmp', 0.111), ('tmx', 0.111), ('vap', 0.111), ('xmt', 0.111), ('tree', 0.104), ('mj', 0.102), ('meteorological', 0.098), ('dpts', 0.089), ('frs', 0.089), ('co', 0.085), ('dir', 0.085), ('rp', 0.084), ('solar', 0.078), ('climate', 0.078), ('radiation', 0.078), ('precision', 0.076), ('graphs', 0.071), ('leaf', 0.068), ('pen', 0.067), ('inf', 0.066), ('greedy', 0.065), ('pre', 0.064), ('covariance', 0.058), ('splitting', 0.058), ('oracle', 0.056), ('xy', 0.055), ('yi', 0.054), ('penalized', 0.053), ('gj', 0.05), ('estimated', 0.049), ('regression', 0.048), ('estimator', 0.047), ('node', 0.047), ('ln', 0.047), ('partitioning', 0.047), ('undirected', 0.046), ('al', 0.046), ('pooled', 0.046), ('argmint', 0.045), ('hyperrectangle', 0.045), ('ipcc', 0.045), ('yk', 0.044), ('tr', 0.041), ('sparse', 0.041), ('induces', 0.04), ('minimization', 0.04), ('meteorology', 0.039), ('dyadically', 0.039), ('subplot', 0.039), ('temperature', 0.039), ('partitions', 0.039), ('locations', 0.038), ('induced', 0.036), ('arizona', 0.036), ('dimension', 0.035), ('excess', 0.035), ('xi', 0.035), ('estimators', 0.035), ('np', 0.034), ('graphical', 0.034), ('rothman', 0.034), ('split', 0.034), ('log', 0.033), ('yj', 0.032), ('region', 0.032), ('hypercube', 0.032), ('estimate', 0.032), ('estimating', 0.03), ('xm', 0.03), ('element', 0.03), ('bl', 0.029), ('conditional', 0.029), ('sparser', 0.027), ('inc', 0.027), ('splits', 0.027), ('smoothing', 0.027), ('assumption', 0.027), ('rectangle', 0.026), ('trees', 0.026), ('friedman', 0.026), ('matrix', 0.025), ('inequalities', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999887 108 nips-2010-Graph-Valued Regression

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

2 0.18821783 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classiﬁcation and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1

3 0.15327527 138 nips-2010-Large Margin Multi-Task Metric Learning

Author: Shibin Parameswaran, Kilian Q. Weinberger

Abstract: Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to support vector machines (svm) by Evgeniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural ﬁt for multi-task learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classiﬁcation problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classiﬁers. 1

4 0.11686841 166 nips-2010-Minimum Average Cost Clustering

Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata

Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1

5 0.10800975 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Author: Han Liu, Kathryn Roeder, Larry Wasserman

Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.

6 0.10497639 203 nips-2010-Parametric Bandits: The Generalized Linear Case

7 0.095943496 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

8 0.095082313 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

9 0.08865875 243 nips-2010-Smoothness, Low Noise and Fast Rates

10 0.083642751 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

11 0.081124127 117 nips-2010-Identifying graph-structured activation patterns in networks

12 0.080572948 144 nips-2010-Learning Efficient Markov Networks

13 0.074301995 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

14 0.073955722 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

15 0.072621994 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

16 0.072551921 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

17 0.072213173 223 nips-2010-Rates of convergence for the cluster tree

18 0.071132421 63 nips-2010-Distributed Dual Averaging In Networks

19 0.070337795 162 nips-2010-Link Discovery using Graph Feature Tracking

20 0.07015489 22 nips-2010-Active Estimation of F-Measures

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.205), (1, 0.06), (2, 0.123), (3, 0.071), (4, -0.036), (5, -0.067), (6, -0.027), (7, -0.004), (8, 0.037), (9, 0.036), (10, -0.08), (11, -0.003), (12, -0.068), (13, 0.001), (14, 0.005), (15, -0.057), (16, 0.052), (17, -0.112), (18, -0.047), (19, -0.035), (20, -0.14), (21, -0.15), (22, -0.052), (23, 0.039), (24, 0.021), (25, 0.062), (26, -0.009), (27, 0.106), (28, 0.043), (29, 0.042), (30, 0.046), (31, 0.021), (32, -0.128), (33, 0.05), (34, 0.036), (35, 0.11), (36, 0.093), (37, -0.178), (38, 0.086), (39, -0.023), (40, -0.037), (41, 0.151), (42, -0.053), (43, 0.065), (44, -0.023), (45, -0.046), (46, -0.085), (47, 0.049), (48, -0.005), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95454222 108 nips-2010-Graph-Valued Regression

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

2 0.73660058 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

3 0.60459816 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Author: Han Liu, Kathryn Roeder, Larry Wasserman

4 0.56925654 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

Author: Rina Foygel, Mathias Drton

Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of signiﬁcant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1

5 0.47711477 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-speciﬁc independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efﬁcient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

6 0.45147032 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

7 0.43347153 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

8 0.43077537 117 nips-2010-Identifying graph-structured activation patterns in networks

9 0.43036374 150 nips-2010-Learning concept graphs from text with stick-breaking priors

10 0.42624488 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance

11 0.4212831 203 nips-2010-Parametric Bandits: The Generalized Linear Case

12 0.42014104 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

13 0.41968626 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

14 0.41262764 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies

15 0.41105807 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

16 0.40926537 66 nips-2010-Double Q-learning

17 0.39975727 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems

18 0.39758185 138 nips-2010-Large Margin Multi-Task Metric Learning

19 0.39728808 172 nips-2010-Multi-Stage Dantzig Selector

20 0.39172205 223 nips-2010-Rates of convergence for the cluster tree

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.029), (17, 0.015), (27, 0.026), (30, 0.046), (35, 0.014), (45, 0.646), (50, 0.035), (52, 0.017), (60, 0.02), (77, 0.024), (90, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99821424 235 nips-2010-Self-Paced Learning for Latent Variable Models

Author: M. P. Kumar, Benjamin Packer, Daphne Koller

Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif ﬁnding and handwritten digit recognition. 1

2 0.99813223 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization

Author: Alper Ayvaci, Michalis Raptis, Stefano Soatto

Abstract: We tackle the problem of simultaneously detecting occlusions and estimating optical ﬂow. We show that, under standard assumptions of Lambertian reﬂection and static illumination, the task can be posed as a convex minimization problem. Therefore, the solution, computed using efﬁcient algorithms, is guaranteed to be globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance. 1

3 0.99795574 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

Author: Nikos Karampatziakis

Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1

4 0.99777317 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet

Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1

same-paper 5 0.99734205 108 nips-2010-Graph-Valued Regression

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

6 0.99683672 133 nips-2010-Kernel Descriptors for Visual Recognition

7 0.99672019 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

8 0.99591637 167 nips-2010-Mixture of time-warped trajectory models for movement decoding

9 0.99549949 100 nips-2010-Gaussian Process Preference Elicitation

10 0.98161954 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

11 0.9788177 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

12 0.97759098 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

13 0.97535294 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering

14 0.97292012 144 nips-2010-Learning Efficient Markov Networks

15 0.97291642 94 nips-2010-Feature Set Embedding for Incomplete Data

16 0.97217101 212 nips-2010-Predictive State Temporal Difference Learning

17 0.97213566 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

18 0.97147548 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

19 0.97128052 118 nips-2010-Implicit Differentiation by Perturbation

20 0.97072107 61 nips-2010-Direct Loss Minimization for Structured Prediction