nips nips2013 nips2013-154 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ying Liu, Alan Willsky
Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. [sent-5, score-0.364]
2 Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. [sent-6, score-0.1]
3 We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. [sent-7, score-0.378]
4 Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. [sent-8, score-0.148]
5 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. [sent-9, score-0.421]
6 By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. [sent-10, score-0.157]
7 We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. [sent-11, score-0.118]
8 1 Introduction In undirected graphical models or Markov random fields, each node represents a random variable while the set of edges specifies the conditional independencies of the underlying distribution. [sent-12, score-0.13]
9 When the random variables are jointly Gaussian, the models are called Gaussian graphical models (GGMs) or Gauss Markov random fields. [sent-13, score-0.109]
10 In general, a larger family of graphs represent a larger collection of distributions and thus can better approximate arbitrary empirical distributions. [sent-15, score-0.106]
11 Since trees have limited modeling capacity, many models beyond trees have been proposed [3, 4, 5, 6]. [sent-19, score-0.148]
12 Thin junction trees (graphs with low tree-width) are extensions of trees, where inference can be solved efficiently using the junction algorithm [7]. [sent-20, score-0.15]
13 Since graphs with large-degree nodes are important in modeling applications such as social networks, flight networks, and robotic localization, we are interested in finding a family of models that allow arbitrarily large degrees while being tractable for learning. [sent-24, score-0.307]
14 Hence, we are interested in finding a family of models that are not only sparse but also have guaranteed efficient inference algorithms. [sent-30, score-0.086]
15 In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all cycles [13]. [sent-31, score-0.385]
16 They have also presented results showing that for models with larger FVSs, approximate inference (obtained by replacing a full FVS by a pseudo-FVS) can work very well, with empirical evidence indicating that a pseudo-FVS of size O(log n) gives excellent results. [sent-33, score-0.098]
17 In Appendix A we will provide some additional analysis of inference for such models (including the computation of the partition function), but the main focus is maximum likelihood (ML) learning of models with FVSs of modest size, including identifying the nodes to include in the FVS. [sent-34, score-0.258]
18 We provide an algorithm for exact ML estimation that, regardless of the maximum degree, has complexity O(kn2 + n2 log n) if the FVS nodes are identified in advance and polynomial complexity if the FVS is to be learned and of bounded size. [sent-37, score-0.365]
19 In the second case, the FVS nodes are taken to be latent variables. [sent-39, score-0.284]
20 In this case, the structure learning problem corresponds to the (exact or approximate) decomposition of an inverse covariance matrix into the sum of a tree-structured matrix and a low-rank matrix. [sent-40, score-0.137]
21 By carefully incorporating efficient inference into the learning steps, we can further reduce the complexity to O(kn2 + n2 log n) per iteration. [sent-43, score-0.098]
22 We also perform experiments using both synthetic data and real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. [sent-44, score-0.118]
23 We show that empirically the family of GGMs of size O(log n) strikes a good balance between the modeling capacity and efficiency. [sent-45, score-0.111]
24 Related Work In the context of classification, the authors of [15] have proposed the tree augmented naive Bayesian model, where the class label variable itself can be viewed as a size-one observed FVS; however, this model does not naturally extend to include a larger FVS. [sent-46, score-0.102]
25 In [16], a convex optimization framework is proposed to learn GGMs with latent variables, where conditioned on a small number of latent variables, the remaining nodes induce a sparse graph. [sent-47, score-0.417]
26 In our setting with latent FVSs, we further require the sparse subgraph to have tree structure. [sent-48, score-0.225]
27 2 Preliminaries Each undirected graphical model has an underlying graph G = (V, E), where V denotes the set of vertices (nodes) and E the set of edges. [sent-49, score-0.093]
28 The parameters J and h are related to the mean µ and covariance matrix Σ by µ = J −1 h and Σ = J −1 . [sent-52, score-0.103]
29 For Gaussian distributions, the s ˆ ˆ ˆ empirical distribution is p(x) = N (x; µ, Σ), where the empirical mean µ = 1 i=1 xi and the ˆ s T s ˆ ˆˆ empirical covariance matrix Σ = 1 i=1 xi xi − µµT . [sent-55, score-0.175]
30 2 Tree-structured models are models whose underlying graphs do not have cycles. [sent-58, score-0.101]
31 We use ˆ ˆ ΣCL = CL(Σ) and ECL = CLE (Σ) to denote respectively the covariance matrix and the set of edges ˆ learned using the Chow-Liu algorithm where the samples have empirical covariance matrix Σ. [sent-60, score-0.316]
32 3 Gaussian Graphical Models with Known FVSs In this section we briefly discuss some of the ideas related to GGMs with FVSs of size k, where we will also refer to the nodes in the FVS as feedback nodes. [sent-61, score-0.319]
33 An example of a graph and its FVS is given in Figure 1, where the full graph (Figure 1a) becomes a cycle-free graph (Figure 1b) if nodes 1 and 2 are removed, and thus the set {1, 2} is an FVS. [sent-62, score-0.338]
34 (a) Full graph; (b) Treestructured subgraph after removing nodes 1 and 2 Graphs with small FVSs have been studied in various contexts. [sent-82, score-0.221]
35 The authors of [17] have characterized the family of graphs with small FVSs and their obstruction sets (sets of forbidden minors). [sent-83, score-0.11]
36 An important point to note is that the complexity of these algorithms depends simply on the size k and the number of nodes n. [sent-88, score-0.235]
37 There is no loss in generality in assuming that the size-k FVS F is fully connected and each of the feedback nodes has edges to every non-feedback node. [sent-89, score-0.323]
38 We will refer to the family of such models with a given Σ M JT FVS F as QF , and the class of models with some FVS of size at most k as Qk . [sent-91, score-0.106]
39 samples, where the feedback nodes are either observed or latent variables. [sent-99, score-0.427]
40 If all nodes are observed, the empirical distribution 1 In general a graph does not have a unique FVS. [sent-100, score-0.256]
41 The family of graphs with FVSs of size k includes all graphs where there exists an FVS of size k. [sent-101, score-0.173]
42 If the feedback ˆ Σ ˆ nodes are latent variables, the empirical distribution p(xT ) has empirical covariance matrix ΣT . [sent-103, score-0.554]
43 ˆ p(xF , xT ) is parametrized by the empirical covariance matrix Σ = ˆ 4. [sent-105, score-0.127]
44 pML (xF , xT ) = (1) q(xF ,xT )∈QF This optimization problem is defined on a highly non-convex set QF with combinatorial structures: indeed, there are (n − k)n−k−2 possible spanning trees among the subgraph induced by the nonfeedback nodes. [sent-111, score-0.206]
45 To obtain a concise expression, we also exploit a property of Gaussian distributions: the conditional information matrix (the information matrix of the conditional distribution) is simply a submatrix of the whole information matrix. [sent-114, score-0.104]
46 In Step 1 of Algorithm 1, we compute the conditional covariance matrix using the Schur complement, and then in Step 2 we use the Chow-Liu algorithm to obtain the best approximate ΣCL (whose inverse is tree-structured). [sent-115, score-0.141]
47 In Step 3, we match exactly the covariance matrix among the feedback nodes and the covariance matrix between the feedback nodes and the non-feedback nodes. [sent-116, score-0.855]
48 For the covariance matrix among the non-feedback nodes, we add the matrix subtracted in Step 1 back to ΣCL . [sent-117, score-0.165]
49 We denote the ˆ output covariance matrix of this algorithm as CCL(Σ). [sent-120, score-0.123]
50 Compute the conditional covariance matrix ΣT |F = ΣT − ΣM Σ−1 ΣT . [sent-122, score-0.121]
51 That observation notwithstanding, the following greedy algorithm (Algorithm 2), which, at each iteration, selects the single best node to add to the current set of feedback nodes, has polynomial complexity for arbitrarily large FVSs. [sent-143, score-0.25]
52 2 When the FVS Nodes Are Latent Variables When the feedback nodes are latent variables, the marginal distribution of observed variables (the −1 T ˜ ˆ non-feedback nodes in the true model) has information matrix JT = Σ−1 = JT −JM JF JM . [sent-146, score-0.674]
53 If the T ˜T is known, the learning problem is equivalent to decomposing a given inverse covariance exact J −1 T ˜ matrix JT into the sum of a tree-structured matrix JT and a rank-k matrix −JM JF JM . [sent-147, score-0.191]
54 3 In general, use the ML criterion qML (xF , xT ) = arg min q(xF ,xT )∈QF DKL (ˆ(xT )||q(xT )), p (3) where the optimization is over all nodes (latent and observed) while the K-L divergence in the objective function is defined on the marginal distribution of the observed nodes only. [sent-148, score-0.455]
55 We propose the latent Chow-Liu algorithm, an alternating projection algorithm that is a variation of the EM algorithm and can be viewed as an instance of the majorization-minimization algorithm. [sent-149, score-0.192]
56 p In the first projection, we obtain a distribution (on both observed and latent variables) whose marginal (on the observed variables) matches exactly the empirical distribution while maintaining the conditional distribution (of the latent variables given the observed ones). [sent-154, score-0.383]
57 Therefore, we are able to compute the second projection exactly even though the graph structure is unknown (which allows any tree structure among the observed nodes). [sent-159, score-0.237]
58 Note that when the feedback nodes are latent, we do −1 3 It is easy to see that different models having the same JM JF JM cannot be distinguished using the samples, and thus without loss of generality we can assume JF is normalized to be the identify matrix in the final solution. [sent-160, score-0.358]
59 5 not need to select the FVS since it is simply the set of latent nodes. [sent-161, score-0.105]
60 This is the source of the simplification when we use latent nodes for the FVS: We have no search of sets of observed variables to include in the FVS. [sent-162, score-0.325]
61 Algorithm 3 The latent Chow-Liu algorithm ˆ Input: the empirical covariance matrix ΣT T JF JM Output: information matrix J = , where JT is tree-structured JM JT T (0) (0) JM J . [sent-163, score-0.286]
62 In Algorithm 3 we summarize the latent Chow-Liu algorithm specialized for our family of GGMs, where both projections have exact closed-form solutions and exhibit complementary structure—one using the covariance and the other using the information parametrization. [sent-173, score-0.247]
63 In projection P1, three blocks of the information matrix remain the same; In projection P2, three blocks of the covariance matrix remain the same. [sent-174, score-0.195]
64 By carefully incorporating the inference algorithms into the projection steps, we are able to further exploit the power of the models and reduce the per-iteration complexity to O(kn2 +n2 log n), which is the same as the complexity of the conditioned Chow-Liu algorithm alone. [sent-177, score-0.236]
65 Due to the page limit, we defer the description of the accelerated version (the accelerated latent Chow-Liu algorithm) and the proof of Proposition 2 to Appendix C. [sent-184, score-0.155]
66 In fact, we never need to exˆ plicitly invert the empirical covariance matrix ΣT in the accelerated version. [sent-185, score-0.152]
67 As a rule of thumb, we often use the spanning tree obtained by the standard Chow-Liu algorithm as an initial tree among the observed nodes. [sent-186, score-0.313]
68 But note that P2 involves solving a combinatorial problem exactly, so the algorithm is able to jump among different graph structures which reduces the chance 6 FBM true model: KL=0 Best Spanning Tree: KL=4. [sent-187, score-0.123]
69 881 Figure 2: From left to right: 1) The true model (fBM with 64 time samples); 2) The best spanning tree; 3) The latent tree learned using the CLRG algorithm in [21]; 4) The latent tree learned using the NJ algorithm in [21]; 5) The model with a size-one latent FVS learned using Algorithm 3. [sent-191, score-0.719]
70 5 0 0 5 10 15 Size of Latent FVS (a) 32 nodes 20 3 2 1 0 0 5 10 15 Size of Latent FVS 5 0 0 20 (b) 64 nodes K−L Divergence 1 4 K−L Divergence K−L Divergence K−L Divergence 10 1. [sent-194, score-0.358]
71 5 5 10 15 Size of Latent FVS (c) 128 nodes 20 15 10 5 0 5 10 15 Size of Latent FVS 20 (d) 256 nodes Figure 3: The relationship between the K-L divergence and the latent FVS size. [sent-195, score-0.519]
72 Figure 2 shows the covariance matrices of approx2 imate models using spanning trees (learned by the Chow-Liu algorithm), latent trees (learned by the CLRG and NJ algorithms in [21]) and our latent FVS model (learned by Algorithm 3) using 64 time samples (nodes). [sent-203, score-0.492]
73 We can see that in the spanning tree the correlation decays quickly (in fact exponentially) with distance, which models the fBM poorly. [sent-204, score-0.189]
74 The latent trees that are learned exhibit blocky artifacts and have little or no improvement over the spanning tree measured in the K-L divergence. [sent-205, score-0.36]
75 In Figure 3, we plot the K-L divergence (between the true model and the learned models using Algorithm 3) versus the size of the latent FVSs for models with 32, 64, 128, and 256 time samples respectively. [sent-206, score-0.275]
76 For these models, we need about 1, 3, 5, and 7 feedback nodes respectively to reduce the K-L divergence to 25% of that achieved by the best spanning tree model. [sent-207, score-0.517]
77 Hence, we speculate that empirically k = O(log n) is a proper choice of the size of the latent FVS. [sent-208, score-0.126]
78 In our experiments, for different initial structures, Algorithm 3 converges to the same graph structures (that give the K-L divergence as shown in Figure 3) within three iterations. [sent-210, score-0.131]
79 Performance of the Greedy Algorithm: Observed FVS In this experiment, we examine the performance of the greedy algorithm (Algorithm 2) when the FVS nodes are observed. [sent-211, score-0.235]
80 For each run, we construct a GGM that has 20 nodes and an FVS of size three as the true model. [sent-212, score-0.2]
81 We first generate a random spanning tree among the non-feedback nodes. [sent-213, score-0.191]
82 Figure 4 shows the graphs (and the K-L divergence) obtained using the greedy algorithm for a typical run. [sent-220, score-0.105]
83 The thicker blue lines represent the edges among the non-feedback nodes and the thinner red lines represent other edges. [sent-230, score-0.232]
84 The red dots denote selected feedback nodes and the blue lines represent edges among the non-feedback nodes (other edges involving the feedback nodes are omitted for clarity). [sent-233, score-0.853]
85 Flight Delay Model: Observed FVS In this experiment, we model the relationships among airports for flight delays. [sent-234, score-0.118]
86 from 1987 to 2008 including information such as scheduled departure time, scheduled arrival time, departure delay, arrival delay, cancellation, and reasons for cancellation for all domestic flights in the U. [sent-238, score-0.214]
87 We want to model how the flight delays at different airports are related to each other using GGMs. [sent-240, score-0.151]
88 First, we compute the average departure delay for each day and each airport (of the top 200 busiest airports) using data from the year 2008. [sent-241, score-0.099]
89 Note that the average departure delays does not directly indicate whether an airport is one of the major airports that has heavy traffic. [sent-242, score-0.212]
90 It is interesting to see whether major airports (especially those notorious for delays) correspond to feedback nodes in the learned models. [sent-243, score-0.429]
91 Figure 5a shows the best tree-structured graph obtained by the Chow-Liu algorithms (with input being the covariance matrix of the average delay). [sent-244, score-0.156]
92 Starting with the next node selected in our greedy algorithm, we begin to see hubs being chosen. [sent-248, score-0.106]
93 In particular, the first ten airports selected in order are: BNA, Chicago, Atlanta, Oakland, Newark, Dallas, San Francisco, Seattle, Washington DC, Salt Lake City. [sent-249, score-0.09]
94 , Los Angeles and JFK) are not selected, as their influence on delays at other domestic airports is well-captured with a tree structure. [sent-252, score-0.252]
95 Willsky, “Exploiting sparse Markov and covariance structure in multiresolution models,” in Proc. [sent-277, score-0.09]
96 Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. [sent-338, score-0.109]
97 Fellows, “Forbidden minors to graphs with small feedback sets,” Discrete Mathematics, vol. [sent-379, score-0.191]
98 Fujito, “A 2-approximation algorithm for the undirected feedback vertex set problem,” SIAM J. [sent-393, score-0.156]
99 Robertson, “Conditional Chow-Liu tree structures for modeling discrete-valued vector time series,” in Proceedings of the 20th conference on Uncertainty in artificial intelligence. [sent-401, score-0.12]
100 Willsky, “Learning latent tree graphical models,” Journal of Machine Learning Research, vol. [sent-411, score-0.223]
wordName wordTfidf (topN-words)
[('fvs', 0.748), ('fvss', 0.278), ('xf', 0.181), ('nodes', 0.179), ('jm', 0.174), ('ggms', 0.17), ('jf', 0.167), ('ggm', 0.124), ('feedback', 0.119), ('latent', 0.105), ('jt', 0.102), ('ight', 0.096), ('airports', 0.09), ('ml', 0.086), ('spanning', 0.085), ('tree', 0.078), ('cl', 0.074), ('qf', 0.074), ('covariance', 0.069), ('xt', 0.061), ('delays', 0.061), ('willsky', 0.058), ('divergence', 0.056), ('bna', 0.056), ('graph', 0.053), ('dkl', 0.051), ('trees', 0.051), ('fbm', 0.049), ('hubs', 0.049), ('graphs', 0.049), ('subgraph', 0.042), ('clrg', 0.042), ('ecl', 0.042), ('learned', 0.041), ('kl', 0.041), ('departure', 0.04), ('graphical', 0.04), ('delay', 0.038), ('eml', 0.037), ('capacity', 0.037), ('greedy', 0.036), ('complexity', 0.035), ('matrix', 0.034), ('family', 0.033), ('projection', 0.029), ('among', 0.028), ('conditioned', 0.028), ('ccl', 0.028), ('cle', 0.028), ('forbidden', 0.028), ('qml', 0.028), ('tournaments', 0.028), ('inference', 0.027), ('junction', 0.026), ('models', 0.026), ('exactly', 0.025), ('accelerated', 0.025), ('edges', 0.025), ('scheduled', 0.025), ('treestructured', 0.025), ('observed', 0.024), ('chandrasekaran', 0.024), ('ft', 0.024), ('empirical', 0.024), ('gauss', 0.023), ('minors', 0.023), ('cancellation', 0.023), ('domestic', 0.023), ('structures', 0.022), ('node', 0.021), ('corrections', 0.021), ('airport', 0.021), ('multiresolution', 0.021), ('cycles', 0.021), ('nj', 0.021), ('size', 0.021), ('exact', 0.02), ('incorporating', 0.02), ('modeling', 0.02), ('algorithm', 0.02), ('brownian', 0.019), ('allerton', 0.019), ('polynomial', 0.019), ('anandkumar', 0.019), ('arrival', 0.019), ('conditional', 0.018), ('alternating', 0.018), ('marginal', 0.017), ('appendix', 0.017), ('vertex', 0.017), ('variables', 0.017), ('xv', 0.016), ('breaks', 0.016), ('choi', 0.016), ('fractional', 0.016), ('eecs', 0.016), ('proposition', 0.016), ('log', 0.016), ('gaussian', 0.015), ('bp', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
Author: Ying Liu, Alan Willsky
Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1
2 0.20385279 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classification pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1
3 0.092047244 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
4 0.08266025 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
5 0.062999927 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.062862433 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
7 0.061906353 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
8 0.060791072 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
9 0.060733244 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
10 0.058739346 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
11 0.05764043 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
12 0.057408068 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
13 0.053774983 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
14 0.052194022 289 nips-2013-Scalable kernels for graphs with continuous attributes
15 0.051887266 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series
16 0.048006125 355 nips-2013-Which Space Partitioning Tree to Use for Search?
17 0.047497783 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
18 0.047469672 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
19 0.046661999 70 nips-2013-Contrastive Learning Using Spectral Methods
20 0.046609607 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
topicId topicWeight
[(0, 0.133), (1, 0.032), (2, -0.015), (3, -0.01), (4, 0.047), (5, 0.028), (6, 0.046), (7, -0.015), (8, 0.022), (9, -0.043), (10, -0.004), (11, -0.1), (12, 0.103), (13, -0.019), (14, -0.038), (15, -0.012), (16, -0.018), (17, 0.026), (18, -0.011), (19, -0.012), (20, 0.087), (21, 0.036), (22, 0.07), (23, -0.048), (24, 0.026), (25, -0.007), (26, 0.052), (27, 0.047), (28, 0.037), (29, 0.006), (30, 0.059), (31, -0.087), (32, -0.028), (33, 0.005), (34, 0.038), (35, 0.126), (36, -0.005), (37, -0.053), (38, 0.046), (39, -0.064), (40, 0.085), (41, 0.017), (42, 0.0), (43, -0.082), (44, 0.095), (45, 0.045), (46, -0.114), (47, 0.095), (48, -0.154), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.87570107 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
Author: Ying Liu, Alan Willsky
Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1
2 0.60476404 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
3 0.58725613 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
4 0.46895093 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
Author: Yacine Jernite, Yonatan Halpern, David Sontag
Abstract: We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM. 1
5 0.45078439 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classification pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1
6 0.44143137 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
7 0.43946201 340 nips-2013-Understanding variable importances in forests of randomized trees
8 0.43709114 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
9 0.43673864 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
10 0.43368831 7 nips-2013-A Gang of Bandits
11 0.42224637 47 nips-2013-Bayesian Hierarchical Community Discovery
12 0.42125076 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
13 0.4179599 85 nips-2013-Deep content-based music recommendation
14 0.39823669 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
15 0.3908852 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
16 0.37675178 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
17 0.37505108 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
18 0.36905125 184 nips-2013-Marginals-to-Models Reducibility
19 0.36733037 289 nips-2013-Scalable kernels for graphs with continuous attributes
20 0.36321425 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
topicId topicWeight
[(16, 0.052), (33, 0.124), (34, 0.075), (41, 0.03), (49, 0.021), (56, 0.09), (70, 0.021), (85, 0.054), (89, 0.02), (93, 0.03), (95, 0.375)]
simIndex simValue paperId paperTitle
1 0.85690862 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
Author: Guy van den Broeck, Adnan Darwiche
Abstract: Lifted inference algorithms exploit symmetries in probabilistic models to speed up inference. They show impressive performance when calculating unconditional probabilities in relational models, but often resort to non-lifted inference when computing conditional probabilities. The reason is that conditioning on evidence breaks many of the model’s symmetries, which can preempt standard lifting techniques. Recent theoretical results show, for example, that conditioning on evidence which corresponds to binary relations is #P-hard, suggesting that no lifting is to be expected in the worst case. In this paper, we balance this negative result by identifying the Boolean rank of the evidence as a key parameter for characterizing the complexity of conditioning in lifted inference. In particular, we show that conditioning on binary evidence with bounded Boolean rank is efficient. This opens up the possibility of approximating evidence by a low-rank Boolean matrix factorization, which we investigate both theoretically and empirically. 1
2 0.80135882 268 nips-2013-Reflection methods for user-friendly submodular optimization
Author: Stefanie Jegelka, Francis Bach, Suvrit Sra
Abstract: Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the benefits of our method on two image segmentation tasks. 1
3 0.75321889 309 nips-2013-Statistical Active Learning Algorithms
Author: Maria-Florina Balcan, Vitaly Feldman
Abstract: We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise and differentially-private. The framework is based on active learning algorithms that are statistical in the sense that they rely on estimates of expectations of functions of filtered random examples. It builds on the powerful statistical query framework of Kearns [30]. We show that any efficient active statistical learning algorithm can be automatically converted to an efficient active learning algorithm which is tolerant to random classification noise as well as other forms of “uncorrelated” noise. We show that commonly studied concept classes including thresholds, rectangles, and linear separators can be efficiently actively learned in our framework. These results combined with our generic conversion lead to the first computationally-efficient algorithms for actively learning some of these concept classes in the presence of random classification noise that provide exponential improvement in the dependence on the error over their passive counterparts. In addition, we show that our algorithms can be automatically converted to efficient active differentially-private algorithms. This leads to the first differentially-private active learning algorithms with exponential label savings over the passive case. 1
4 0.71672857 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
Author: Po-Ling Loh, Martin J. Wainwright
Abstract: We establish theoretical results concerning local optima of regularized M estimators, where both loss and penalty functions are allowed to be nonconvex. Our results show that as long as the loss satisfies restricted strong convexity and the penalty satisfies suitable regularity conditions, any local optimum of the composite objective lies within statistical precision of the true parameter vector. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for errors-in-variables linear models and regression in generalized linear models using nonconvex regularizers such as SCAD and MCP. On the optimization side, we show that a simple adaptation of composite gradient descent may be used to compute a global optimum up to the statistical precision stat in log(1/ stat ) iterations, the fastest possible rate for any first-order method. We provide simulations to illustrate the sharpness of our theoretical predictions. 1
same-paper 5 0.69362295 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
Author: Ying Liu, Alan Willsky
Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1
6 0.64081788 185 nips-2013-Matrix Completion From any Given Set of Observations
7 0.59144294 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
8 0.58604842 42 nips-2013-Auditing: Active Learning with Outcome-Dependent Query Costs
9 0.58338553 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions
10 0.55975604 149 nips-2013-Latent Structured Active Learning
11 0.55875707 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
12 0.5468657 184 nips-2013-Marginals-to-Models Reducibility
13 0.5458684 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
14 0.54262489 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
15 0.53719032 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
16 0.53312892 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
17 0.53160769 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
18 0.53063732 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
19 0.52755231 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
20 0.52045363 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent