nips nips2010 nips2010-144 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. [sent-4, score-0.144]
2 This is made possible by exploiting context-specific independence and determinism in the domain. [sent-5, score-0.133]
3 The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc. [sent-6, score-0.379]
4 Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. [sent-8, score-0.537]
5 We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. [sent-9, score-0.475]
6 Inference in Markov networks is intractable [25], and approximate inference schemes can be unreliable, and often require much hand-crafting. [sent-15, score-0.144]
7 Structure learning – the problem of finding the features of the Markov network – is also intractable [15], and has weight learning and inference as subroutines. [sent-18, score-0.187]
8 Intractable inference and weight optimization can be avoided if we restrict ourselves to decomposable Markov networks [22]. [sent-19, score-0.189]
9 A decomposable model can be expressed as a product of distributions over the cliques in the graph divided by the product of the distributions of their intersections. [sent-20, score-0.138]
10 More recently, a series of papers have proposed methods for directly learning junction trees of bounded treewidth ([2, 21, 8] etc. [sent-25, score-0.506]
11 Unfortunately, since the complexity of inference (and typically of learning) is exponential in the treewidth, only models of very low treewidth (typically 2 or 3) are feasible in practice, and thin junction trees have not found wide applicability. [sent-27, score-0.667]
12 Models can have high treewidth and still allow tractable inference and closed-form weight learning from a reasonable number of samples, by exploiting context-specific independence [6] and determinism [7]. [sent-29, score-0.364]
13 Decomposable models can be expressed as both Markov networks and Bayesian networks, and stateof-the-art Bayesian network learners extensively exploit context-specific independence [9]. [sent-36, score-0.222]
14 Lowd and Domingos [18] learned tractable hightreewidth Bayesian networks by penalizing inference complexity along with model complexity in a standard Bayesian network learner. [sent-38, score-0.33]
15 (The treewidth of the resulting model can still be as large as the number of variables. [sent-46, score-0.126]
16 ) We then propose greedy heuristics for more efficient learning, and show empirically that the Markov networks learned in this way are more accurate than thin junction trees as well as networks learned using the algorithm of Della Pietra et al. [sent-47, score-0.618]
17 An atomic feature or literal is an assignment of a value to a variable. [sent-54, score-0.365]
18 x denotes the assignment x = 1 while ¬x denotes x = 0 (note that the distinction between an atomic feature x and the variable which is also denoted by x is usually clear from context). [sent-55, score-0.396]
19 A feature, denoted by F , defined over a subset of variables V (F ) is formed by conjoining atomic features or literals, e. [sent-56, score-0.436]
20 , x1 ∧ ¬x2 is a feature formed by conjoining two atomic features x1 and ¬x2 . [sent-58, score-0.538]
21 A feature that is not satisfied is said to be assigned the value 0. [sent-60, score-0.178]
22 Often, given a feature F , we will abuse notation and write V (F ) as F . [sent-61, score-0.178]
23 A Markov network or a log-linear model is defined as a set of pairs (Fi , wi ) where Fi is a feature and wi is its weight. [sent-62, score-0.242]
24 , Cm } be a collection of subsets of V such that: (a) ∪m Ci = V and (b) for each feature Fj , there exists a Ci ∈ C such that all variables of Fj are i=1 contained in Ci . [sent-68, score-0.223]
25 A leaf feature is formed by conjoining the feature assignments along the path from the leaf to the root. [sent-71, score-0.869]
26 For example, the feature corresponding to the right most leaf node is: (x1 ∧ x2 ) ∧ (x4 ∧ x6 ). [sent-72, score-0.337]
27 For the feature tree, ovals denote F-nodes and rectangles denote A-nodes. [sent-73, score-0.218]
28 For the junction tree, ovals denote cliques and rectangles denote separators. [sent-74, score-0.392]
29 Notice that each F-node in the feature tree has a feature of size bounded by 2 while the maximum clique in the junction tree is of size 5. [sent-75, score-1.192]
30 A tree T = (C, E) is a junction tree iff it satisfies the running intersection property, i. [sent-78, score-0.705]
31 The treewidth of T , denoted by w, is the size of the largest clique in C minus one. [sent-81, score-0.229]
32 m The space complexity of representing a junction tree is O( i=1 2|Ci | ) ≡ O(n × 2w+1 ). [sent-83, score-0.548]
33 Our goal is to exploit context-specific and deterministic dependencies that is not explicitly represented in junction trees. [sent-84, score-0.303]
34 We will use a more convenient form for our purposes, which we call feature graphs. [sent-86, score-0.178]
35 Inference in feature graphs is linear in the size of the graph. [sent-87, score-0.242]
36 For readers familiar with AND/OR graphs [11], a feature tree (or graph) is simply an AND/OR tree (or graph) with OR nodes corresponding to features and AND nodes corresponding to feature assignments. [sent-88, score-0.845]
37 A feature tree denoted by ST is a rooted-tree that consists of alternating levels of feature nodes or F-nodes and feature assignment nodes or A-nodes. [sent-90, score-0.839]
38 Each F-node F is labeled by a feature F and has two child A-nodes labeled by 0 and 1, corresponding to the true and the false assignments of F respectively. [sent-91, score-0.344]
39 , FA,k } be the set of child F-nodes of A and let D(FA,i ) be the union of all variables involved in the features associated with FA,i and all its descendants, then ∀i, j ∈ {1, . [sent-96, score-0.188]
40 The space complexity of representing a feature tree is the number of its A-nodes. [sent-101, score-0.423]
41 A feature graph denoted by SG is formed by merging identical subtrees of a feature tree ST . [sent-102, score-0.721]
42 It is easy to show that a feature graph generalizes a junction tree and in fact any model that can be represented using a junction tree having treewidth k can also be represented by a feature graph that uses only O(n × 2k ) space [11]. [sent-103, score-1.578]
43 In some cases, a feature graph can be exponentially smaller than a junction tree because it can capture context-specific independence [6]. [sent-104, score-0.811]
44 A feature tree can be easily converted to a Markov network. [sent-105, score-0.379]
45 The corresponding Markov network has one feature for each leaf node, formed by conjoining all feature assignments from the root to the leaf. [sent-106, score-0.852]
46 The following example demonstrates the relationship between a feature tree, a Markov network and a junction tree. [sent-107, score-0.545]
47 Figure 1(b) shows the Markov network corresponding to the leaf features of the feature tree given in Figure 1(a). [sent-110, score-0.617]
48 Figure 1(c) shows the junction tree for the Markov network given in 1(b). [sent-111, score-0.568]
49 Notice that because the feature tree uses context-specific independence, all the F -nodes in the feature tree have a feature of size bounded by 2 while the maximum clique size of the junction tree is 5. [sent-112, score-1.571]
50 The junction tree given in Figure 1(b) requires 25 ×2 = 64 potential values while the feature tree given in Figure 1(a) requires only 10 A-nodes. [sent-113, score-0.883]
51 In this paper, we will present structure learning algorithms to learn feature trees only. [sent-114, score-0.225]
52 We can do this without loss of generality, because a feature graph can be constructed by caching information and merging identical nodes, while learning (constructing) a feature tree. [sent-115, score-0.435]
53 3 The distribution represented by a feature tree ST can be defined procedurally as follows (for more details see [11]). [sent-116, score-0.379]
54 The value of the leaf A-node Al is w(Al ) × #(M (Al )) where #(M (Al )) is number of (full) variable assignments that satisfy the constraint M (Al ) formed by conjoining the feature-assignments from the root to Al . [sent-121, score-0.432]
55 3 Learning Efficient Structure Algorithm 1: LMIP: Low Mutual Information Partitioning Input: A variable set V , sample data D, mutual information subroutine I, a feature assignment F , threshold δ, max set size q. [sent-126, score-0.494]
56 return QF We propose a feature-based structure learning algorithm that searches for a feature that divides the configuration space into subspaces. [sent-133, score-0.251]
57 We will assume that the selected feature or its negation divides the (remaining) variables into conditionally independent partitions (we don’t require this assumption to be always satisfied, as we explain in the section on greedy heuristics and implementation details). [sent-134, score-0.338]
58 Therefore, as in previous work [21, 8], we instead use conditional mutual information, denoted by I, to partition the set of variables. [sent-136, score-0.166]
59 For this we use the LMIP subroutine (see Algorithm 1), a variant of Chechetka and Guestrin’s [8] LTCI algorithm that outputs a partitioning of V . [sent-137, score-0.178]
60 The runtime guarantees of LMIP follow from those of LTCI and correctness guarantees follow in an analogous fashion. [sent-138, score-0.156]
61 In general, estimating mutual information between sets of random variables has time and sample complexity exponential in the number of variables considered. [sent-139, score-0.228]
62 Given a feature assignment F , a distribution P (V ) is (j, ǫ, F )-coverable if there exists a set of cliques C such that for every Ci ∈ C, |Ci | ≤ j and I(Ci , V \ Ci |F ) ≤ ǫ. [sent-143, score-0.3]
63 Similarly, given a feature F , a distribution P (V ) is (j, ǫ, F )-coverable if it is both (j, ǫ, F = 0)-coverable and (j, ǫ, F = 1)-coverable. [sent-144, score-0.178]
64 The time and space complexity of LMIP is O( n × n × Jq I ) where Jq I is the time q complexity of estimating the mutual information between two disjoint sets which have combined cardinality q. [sent-157, score-0.182]
65 Note that our actual algorithm will use a subroutine that estimates mutual information from data, and the time complexity of this routine will be described in the section on sample complexity and probabilistic performance guarantees. [sent-158, score-0.335]
66 Algorithm 2: LEM: Learning Efficient Markov Networks Input: Variable set V , sample data S, mutual information subroutine I, feature length k, set size parameter q, threshold δ, an A-node A. [sent-159, score-0.421]
67 Next, we present our structure learning algorithm called LEM (see Algorithm 2) which utilizes the LMIP subroutine to learn feature trees from data. [sent-163, score-0.345]
68 First, it runs the LMIP subroutine on all possible features of length k constructible from V . [sent-167, score-0.212]
69 Recall that given a feature assignment F , the LMIP sub-routine partitions the variables into (approximately) conditionally independent components. [sent-168, score-0.296]
70 It then selects a feature G having the highest score. [sent-169, score-0.178]
71 Intuitively, to reduce the inference time and the size of the model, we should try to balance the trade-off between increasing the number of partitions and maintaining partition size uniformity (namely, we would want the partition sizes to be almost equal). [sent-170, score-0.211]
72 After selecting a feature G, the algorithm creates a F-node corresponding to G and two child A-nodes corresponding to the true and the false assignments of G. [sent-176, score-0.344]
73 Then, corresponding to each element of QG=1 , it recursively creates a child node for G = 1 (and similarly for G = 0 using QG=0 ). [sent-177, score-0.164]
74 In this case, no partitioning of V exists for either or both the value assignments of G and therefore we return a feature tree which has 2|V | leaf A-nodes corresponding to all possible instantiations of the remaining variables. [sent-179, score-0.705]
75 Intuitively, if there exists a feature F such that the distribution P (V ) at each recursive call to LEM is (j, ǫ, F )coverable, then the LMIP sub-routine is guaranteed to return at least a two-way partitioning of V . [sent-185, score-0.312]
76 Then, LEM is guaranteed to find this unique feature tree. [sent-187, score-0.178]
77 Therefore, we want this coverability requirement to hold not only recursively but also for each candidate feature (at each recursive call). [sent-189, score-0.286]
78 For every feature F , and each assignment F of F , such that |V (F )| ≤ m, P (V ) is (j, ǫ, F )-coverable and for any partitioning S1 , . [sent-193, score-0.309]
79 Given a distribution P (V ) that satisfies the (j, ǫ, m, true)-assumption and a perfect mutual information oracle I, LEM(V , S, I, k, j + 1, δ) returns a feature tree ST such that each leaf feature of ST satisfies the nested context independence condition for (ǫ, j × δ). [sent-209, score-0.89]
80 1 Sample Complexity and Probabilistic Performance Guarantees The foregoing analysis relies on a perfect, deterministic mutual information subroutine I. [sent-212, score-0.214]
81 In reality, all we have is sample data and probabilistic mutual information subroutines. [sent-213, score-0.127]
82 Given a mutual information subroutine I implied by Lemma 4, LEM(V , S, I, m, j + 1, ǫ + ∆) returns a feature tree, the leaves of which satisfy the nested context independence condition for (ǫ, j × (ǫ + ∆)), with probability 1 − γ. [sent-225, score-0.509]
83 The most expensive step in LEM is the LMIP sub-routine which is called O(nk ) times at each A-node of the feature 6 graph. [sent-227, score-0.178]
84 The greedy heuristic is able to split on arbitrarily long features by only calling LMIP k × n times instead of O(nk ) times, but does not have any guarantees. [sent-240, score-0.129]
85 , just the variables in the domain), runs LMIP on each, and selects the (best) feature with the highest score. [sent-243, score-0.223]
86 Then, it creates candidate features by conjoining this best feature from the previous step with each atomic feature, runs LMIP on each, and then selects a best feature for the next iteration. [sent-244, score-0.692]
87 Here, given a set of features with similar scores, we select a feature F such that the difference between the scores of F = 0 and F = 1 is the smallest. [sent-249, score-0.23]
88 The intuition behind this heuristic is that by maintaining balance we reduce the height of the feature graph and thus its size. [sent-250, score-0.253]
89 Finally, in our implementation, we do not return all possible instantiations of the variables when a feature assignment yields only one partition, unless the number of remaining variables is smaller than 5. [sent-251, score-0.412]
90 This is because even though a feature may not partition the set of variables, it may still partition the data, thereby reducing complexity. [sent-252, score-0.26]
91 Figure 2(f) lists the five data sets and the number of atomic features in each. [sent-254, score-0.166]
92 [24] and the lazy thin-junction tree algorithm (LPACJT) of Chechetka and Guestrin [8]. [sent-260, score-0.201]
93 We were unable to compute the entropies in the required format because they use a propriety software that we did not have access to, and therefore we use the results provided by the authors for the temperature, traffic and alarm domains. [sent-267, score-0.129]
94 For the sensor networks, traffic and alarm domains, we use the test set sizes provided in Chechtka and Guestrin [8]. [sent-278, score-0.129]
95 The size of the feature graphs learned by LEM ranged from O(n2 ) to O(n3 ), comparable to those generated by LPACJT. [sent-291, score-0.242]
96 Exact inference on the learned feature graphs was a matter of milliseconds. [sent-292, score-0.284]
97 6 Conclusions We have presented an algorithm for learning a class of high-treewidth Markov networks that admit tractable inference and closed-form parameter learning. [sent-300, score-0.178]
98 This class is much richer than thin junction trees because it exploits context-specific independence and determinism. [sent-301, score-0.511]
99 Although learning may be slow, inference always has quick and predictable runtime, which is linear in the size of the feature graph. [sent-305, score-0.278]
100 The alarm monitoring system: A case study with two probablistic inference techniques for belief networks. [sent-326, score-0.2]
wordName wordTfidf (topN-words)
[('lem', 0.42), ('lmip', 0.339), ('junction', 0.303), ('tree', 0.201), ('dl', 0.195), ('qf', 0.181), ('feature', 0.178), ('conjoining', 0.14), ('lpacjt', 0.14), ('qg', 0.129), ('alarm', 0.129), ('treewidth', 0.126), ('leaf', 0.122), ('subroutine', 0.12), ('efinition', 0.12), ('atomic', 0.114), ('markov', 0.104), ('msnbc', 0.1), ('mutual', 0.094), ('ci', 0.092), ('child', 0.091), ('della', 0.087), ('pietra', 0.087), ('independence', 0.085), ('al', 0.082), ('emma', 0.08), ('guarantees', 0.078), ('thin', 0.076), ('assignments', 0.075), ('qi', 0.074), ('assignment', 0.073), ('networks', 0.073), ('traf', 0.071), ('inference', 0.071), ('fi', 0.068), ('lowd', 0.064), ('network', 0.064), ('adult', 0.059), ('partitioning', 0.058), ('guestrin', 0.056), ('satis', 0.055), ('domingos', 0.055), ('formed', 0.054), ('gi', 0.053), ('chechetka', 0.052), ('features', 0.052), ('si', 0.051), ('cliques', 0.049), ('determinism', 0.048), ('trees', 0.047), ('greedy', 0.046), ('temperature', 0.046), ('decomposable', 0.045), ('fr', 0.045), ('variables', 0.045), ('complexity', 0.044), ('graph', 0.044), ('clique', 0.043), ('recursive', 0.042), ('root', 0.041), ('partition', 0.041), ('st', 0.041), ('chechtka', 0.04), ('constructible', 0.04), ('heorem', 0.04), ('ovals', 0.04), ('divides', 0.039), ('cj', 0.038), ('twentieth', 0.038), ('node', 0.037), ('instantiations', 0.037), ('minx', 0.037), ('sc', 0.037), ('domains', 0.036), ('recursively', 0.036), ('recursion', 0.036), ('literals', 0.035), ('jq', 0.035), ('ltci', 0.035), ('graphs', 0.035), ('merging', 0.035), ('score', 0.034), ('intelligence', 0.034), ('return', 0.034), ('tractable', 0.034), ('probabilistic', 0.033), ('capital', 0.032), ('queyranne', 0.032), ('repository', 0.032), ('nested', 0.032), ('bayesian', 0.031), ('heuristic', 0.031), ('cial', 0.031), ('denoted', 0.031), ('candidate', 0.03), ('negation', 0.03), ('bounded', 0.03), ('arti', 0.029), ('size', 0.029), ('irvine', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
2 0.16050199 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
3 0.13262792 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
4 0.12462141 118 nips-2010-Implicit Differentiation by Perturbation
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
5 0.12025722 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
6 0.11879946 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
7 0.1168851 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
8 0.10538703 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
9 0.093376845 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
10 0.083698168 63 nips-2010-Distributed Dual Averaging In Networks
11 0.080572948 108 nips-2010-Graph-Valued Regression
12 0.080364265 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
13 0.071823105 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
14 0.068952687 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
15 0.063750498 283 nips-2010-Variational Inference over Combinatorial Spaces
16 0.063297816 94 nips-2010-Feature Set Embedding for Incomplete Data
17 0.060162418 150 nips-2010-Learning concept graphs from text with stick-breaking priors
18 0.060114421 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
19 0.059008118 223 nips-2010-Rates of convergence for the cluster tree
20 0.057089809 155 nips-2010-Learning the context of a category
topicId topicWeight
[(0, 0.194), (1, 0.037), (2, 0.051), (3, 0.034), (4, -0.128), (5, -0.048), (6, -0.025), (7, 0.024), (8, 0.087), (9, 0.014), (10, -0.125), (11, 0.008), (12, -0.091), (13, 0.001), (14, -0.012), (15, -0.132), (16, -0.026), (17, -0.085), (18, -0.088), (19, -0.071), (20, -0.101), (21, -0.013), (22, 0.11), (23, -0.195), (24, -0.045), (25, 0.034), (26, 0.014), (27, 0.055), (28, -0.035), (29, -0.0), (30, -0.017), (31, -0.062), (32, -0.072), (33, -0.026), (34, 0.058), (35, 0.098), (36, -0.029), (37, 0.003), (38, -0.051), (39, 0.024), (40, 0.037), (41, 0.028), (42, 0.118), (43, -0.01), (44, -0.03), (45, -0.193), (46, -0.091), (47, -0.007), (48, 0.078), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.92932445 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
2 0.70956361 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
Author: Nebojsa Jojic, Chris Meek, Jim C. Huang
Abstract: Many problem domains including climatology and epidemiology require models that can capture both heavy-tailed statistics and local dependencies. Specifying such distributions using graphical models for probability density functions (PDFs) generally lead to intractable inference and learning. Cumulative distribution networks (CDNs) provide a means to tractably specify multivariate heavy-tailed models as a product of cumulative distribution functions (CDFs). Existing algorithms for inference and learning in CDNs are limited to those with tree-structured (nonloopy) graphs. In this paper, we develop inference and learning algorithms for CDNs with arbitrary topology. Our approach to inference and learning relies on recursively decomposing the computation of mixed derivatives based on a junction trees over the cumulative distribution functions. We demonstrate that our systematic approach to utilizing the sparsity represented by the junction tree yields signiďŹ cant performance improvements over the general symbolic differentiation programs Mathematica and D*. Using two real-world datasets, we demonstrate that non-tree structured (loopy) CDNs are able to provide signiďŹ cantly better ďŹ ts to the data as compared to tree-structured and unstructured CDNs and other heavy-tailed multivariate distributions such as the multivariate copula and logistic models.
3 0.70160317 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
4 0.62757891 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
5 0.60484481 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
Author: Anton Chechetka, Carlos Guestrin
Abstract: We present a simple and effective approach to learning tractable conditional random fields with structure that depends on the evidence. Our approach retains the advantages of tractable discriminative models, namely efficient exact inference and arbitrarily accurate parameter learning in polynomial time. At the same time, our algorithm does not suffer a large expressive power penalty inherent to fixed tractable structures. On real-life relational datasets, our approach matches or exceeds state of the art accuracy of the dense models, and at the same time provides an order of magnitude speedup. 1
6 0.60258734 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
7 0.57063621 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
8 0.55820388 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
9 0.52595907 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
10 0.52524555 283 nips-2010-Variational Inference over Combinatorial Spaces
11 0.47989711 108 nips-2010-Graph-Valued Regression
12 0.47571844 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
13 0.45673698 118 nips-2010-Implicit Differentiation by Perturbation
14 0.45531255 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
15 0.44940642 1 nips-2010-(RF)^2 -- Random Forest Random Field
16 0.44549453 71 nips-2010-Efficient Relational Learning with Hidden Variable Detection
17 0.43545189 168 nips-2010-Monte-Carlo Planning in Large POMDPs
18 0.38573924 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
19 0.3760705 190 nips-2010-On the Convexity of Latent Social Network Inference
20 0.37517038 63 nips-2010-Distributed Dual Averaging In Networks
topicId topicWeight
[(13, 0.042), (17, 0.014), (18, 0.238), (24, 0.01), (27, 0.058), (30, 0.076), (35, 0.023), (45, 0.281), (50, 0.037), (52, 0.02), (60, 0.03), (77, 0.053), (78, 0.024), (90, 0.018)]
simIndex simValue paperId paperTitle
1 0.91032457 267 nips-2010-The Multidimensional Wisdom of Crowds
Author: Peter Welinder, Steve Branson, Pietro Perona, Serge J. Belongie
Abstract: Distributing labeling tasks among hundreds or thousands of annotators is an increasingly important method for annotating large datasets. We present a method for estimating the underlying value (e.g. the class) of each image from (noisy) annotations provided by multiple annotators. Our method is based on a model of the image formation and annotation process. Each image has different characteristics that are represented in an abstract Euclidean space. Each annotator is modeled as a multidimensional entity with variables representing competence, expertise and bias. This allows the model to discover and represent groups of annotators that have different sets of skills and knowledge, as well as groups of images that differ qualitatively. We find that our model predicts ground truth labels on both synthetic and real data more accurately than state of the art methods. Experiments also show that our model, starting from a set of binary labels, may discover rich information, such as different “schools of thought” amongst the annotators, and can group together images belonging to separate categories. 1
same-paper 2 0.85663295 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
3 0.85140568 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Author: Ling Huang, Jinzhu Jia, Bin Yu, Byung-gon Chun, Petros Maniatis, Mayur Naik
Abstract: Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.
4 0.83171612 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
Author: Fuxin Li, Cristian Sminchisescu
Abstract: We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach.
5 0.78559476 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.78418583 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
7 0.78415388 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
8 0.78325129 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
9 0.78297359 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
10 0.7828806 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
11 0.7825197 243 nips-2010-Smoothness, Low Noise and Fast Rates
12 0.78238922 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
13 0.7821483 27 nips-2010-Agnostic Active Learning Without Constraints
14 0.78214604 155 nips-2010-Learning the context of a category
15 0.78207999 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
16 0.78207755 290 nips-2010-t-logistic regression
17 0.78153074 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
18 0.78121823 1 nips-2010-(RF)^2 -- Random Forest Random Field
19 0.78109092 61 nips-2010-Direct Loss Minimization for Structured Prediction
20 0.78086364 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls