nips nips2002 nips2002-203 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. [sent-6, score-0.621]
2 However, for large data-sets it is the construction of the correlation matrix that dominates the running time. [sent-7, score-0.147]
3 We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. [sent-8, score-0.532]
4 The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. [sent-9, score-0.373]
5 The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. [sent-10, score-0.352]
6 However, the problem of constructing Bayes’ nets from data remains a hard one, requiring search in a super-exponential space of possible graph structures. [sent-14, score-0.19]
7 Namely, we look at dependency trees, which are belief networks that satisfy the additional constraint that each node has at most one parent. [sent-17, score-0.162]
8 In this simple case it has been shown [2] that finding the tree that maximizes the data likelihood is equivalent to finding a minimumweight spanning tree in the attribute graph, where edge weights are derived from the mutual information of the corresponding attribute pairs. [sent-18, score-1.447]
9 Once the weight matrix is constructed, executing a minimum spanning tree (MST) algo- rithm is fast. [sent-21, score-0.501]
10 The time-consuming part is the population of the weight matrix, which takes time O(Rn2 ) for R records and n attributes. [sent-22, score-0.289]
11 This becomes expensive when considering datasets with hundreds of thousands of records and hundreds of attributes. [sent-23, score-0.382]
12 To overcome this problem, we propose a new way of interleaving the spanning tree construction with the operations needed to compute the mutual information coefficients. [sent-24, score-0.579]
13 This algorithm is capable of using partial knowledge about edge weights and of signaling the need for more accurate information regarding a particular edge. [sent-26, score-0.586]
14 The partial information we maintain is in the form of probabilistic confidence intervals on the edge weights; an interval is derived by looking at a sub-sample of the data for a particular attribute pair. [sent-27, score-0.862]
15 Whenever the algorithm signals that a currently-known interval is too wide, we inspect more data records in order to shrink it. [sent-28, score-0.61]
16 Once the interval is small enough, we may be able to prove that the corresponding edge is not a part of the tree. [sent-29, score-0.505]
17 Whenever such an edge can be eliminated without looking at the full data-set, the work associated with the remainder of the data is saved. [sent-30, score-0.464]
18 We have implemented the algorithm for numeric and categorical data and tested it on real and synthetic data-sets containing hundreds of attributes and millions of records. [sent-32, score-0.6]
19 The resulting trees are, in most cases, of near-identical quality to the ones grown by the naive algorithm. [sent-34, score-0.124]
20 In the context of dependency trees, Meila [11] discusses the discrete case that frequently comes up in text-mining applications, where the attributes are sparse in the sense that only a small fraction of them is true for any record. [sent-38, score-0.343]
21 The number of data records is R, the number of attributes n. [sent-41, score-0.48]
22 We denote by ρxy the correlation coefficient between attributes x and y, and omit the subscript when it is clear from the context. [sent-43, score-0.234]
23 2 A slow minimum-spanning tree algorithm We begin by describing our MST algorithm1 . [sent-44, score-0.35]
24 We then proceed to describe its use in the case where some edge weights are known not exactly, but rather only to lie within a given interval. [sent-46, score-0.391]
25 In the following discussion we assume we are given a complete graph with n nodes, and the task is to find a tree connecting all of its nodes such that the total tree weight (defined to be the sum of the weights of its edges) is minimized. [sent-48, score-0.682]
26 We start with a rule to eliminate edges from consideration for the output tree. [sent-50, score-0.334]
27 Following [5], we state the so-called “red-edge” rule: Theorem 1: The heaviest edge in any cycle in the graph is not part of the minimum 1 To be precise, we will use it as a maximum spanning tree algorithm. [sent-51, score-0.974]
28 The two are interchangeable, requiring just a reversal of the edge weight comparison operator. [sent-52, score-0.41]
29 While |L| > n − 1 do: ¯ Pick an arbitrary edge e ∈ L \ T . [sent-57, score-0.337]
30 Let e be the heaviest edge on the path in T between the endpoints of e. [sent-58, score-0.504]
31 L contains the set of edges that have been proven to not be in the MST and so L contains the set of edges that still have some chance of being in the MST. [sent-63, score-0.298]
32 Traditionally, MST algorithms use this rule in conjunction with a greedy “blue-edge” rule, which chooses edges for inclusion in the tree. [sent-66, score-0.213]
33 In contrast, we will repeatedly use the red-edge rule until all but n − 1 edges have been eliminated. [sent-67, score-0.213]
34 The proof this results in a minimum-spanning tree follows from [5]. [sent-68, score-0.27]
35 Denote by L the set of edges that have already been ¯ eliminated, and let L = E \ L. [sent-70, score-0.149]
36 As a way to guide our search for edges to eliminate we maintain the following invariant: ¯ Invariant 2: At any point there is a spanning tree T , which is composed of edges in L. [sent-71, score-0.956]
37 ¯ In each step, we arbitrarily choose some edge e in L \ T and try to eliminate it using the red-edge rule. [sent-72, score-0.458]
38 It is clear we only need to compare e with the heaviest edge in P . [sent-75, score-0.459]
39 If e is heavier, we can eliminate it by the red-edge rule. [sent-76, score-0.121]
40 However, if it is lighter, then we can eliminate the tree edge by the same rule. [sent-77, score-0.728]
41 We do so and add e to the tree to preserve Invariant 2. [sent-78, score-0.27]
42 The MIST algorithm can be applied directly to a graph where the edge weights are known exactly. [sent-80, score-0.522]
43 And like many other MST algorithms, it can also be used in the case where just the relative order of the edge weights is given. [sent-81, score-0.436]
44 Now imagine a different setup, where edge weights are not given, and instead an oracle exists, who knows the exact values of the edge weights. [sent-82, score-0.847]
45 In this setup, MIST is still suited for finding a spanning tree while minimizing the number of queries issued. [sent-85, score-0.464]
46 Otherwise, it just ignores the “if” clause altogether and iterates (possibly with a different edge e). [sent-88, score-0.337]
47 For the moment, this setup may seem contrived, but in Section 4, we go back to the MIST algorithm and put it in a context very similar to the one described here. [sent-89, score-0.117]
48 3 Probabilistic bounds on mutual information We now concentrate once again on the specific problem of determining the mutual information between a pair of attributes. [sent-90, score-0.154]
49 We show how to compute it given the complete data, and how to derive probabilistic confidence intervals for it, given just a sample of the data. [sent-91, score-0.167]
50 As shown in [12], the mutual information for two jointly Gaussian numeric attributes X and Y is: 1 I(X; Y ) = − ln(1 − ρ2 ) 2 R ((xi −¯)(yi −¯)) x y i=1 where the correlation coefficient ρ = ρXY = σX σY ˆ2 ˆ2 being the sample means and variances for attributes X and Y . [sent-92, score-0.646]
51 This is a sufficient condition for the use of |ρ| as the edge weight in a MST algorithm. [sent-94, score-0.374]
52 Consequently, the sample correlation can be used in a straightforward manner when the complete data is available. [sent-95, score-0.139]
53 We are trying to estimate i=1 xi · yi given the partial r sum i=1 xi · yi for some r < R. [sent-98, score-0.25]
54 We thus can derive a two-sided confidence interval for i Zi = i xi · yi with probability 1 − δ for some user-specified δ, typically 1%. [sent-103, score-0.302]
55 Given this interval, computing an interval for ρ is straightforward. [sent-104, score-0.168]
56 4 The full algorithm As we argued, the MIST algorithm is capable of using partial information about edge weights. [sent-106, score-0.612]
57 We have also shown how to derive confidence intervals on edge weights. [sent-107, score-0.445]
58 We initialize the tree T in the following heuristic way: first we take a small sub-sample of the data, and derive point estimates for the edge weights from it. [sent-110, score-0.739]
59 Then we feed the point estimates to a MST algorithm and obtain a tree T . [sent-111, score-0.35]
60 When we come to compare edge weights, we generally need to deal with two intervals. [sent-112, score-0.337]
61 We apply this logic to all comparisons, where the goal is to determine the heaviest path edge e and to compare it to the candidate e. [sent-114, score-0.571]
62 If we are lucky enough that all of these comparisons are conclusive, the amount of work we save is related to how much data was used in computing the confidence intervals — the rest of the data for the attribute-pair that is represented by the eliminated edge can be ignored. [sent-115, score-0.595]
63 The price we need to pay here is looking at more data to shrink the confidence intervals. [sent-119, score-0.152]
64 We do this by choosing one edge — either a tree-path edge or the candidate edge — for “promotion”, and doubling the sample size used to compute the sufficient statistics for it. [sent-120, score-1.137]
65 After doing so we try to eliminate again (since we can do this at no additional cost). [sent-121, score-0.121]
66 If we fail to eliminate we iterate, possibly choosing a different candidate edge (and the corresponding tree path) this time. [sent-122, score-0.795]
67 The choice of which edge to promote is heuristic, and depends on the expected success of resolution once the interval has shrunk. [sent-123, score-0.505]
68 Consider the comparison of the path-heaviest edge to an estimate of a candidate edge. [sent-126, score-0.404]
69 The candidate edge’s confidence interval may be very small, and yet still intersect the interval that is the heavy edge’s weight (this would happen if, for example, both attribute-pairs have the same distribution). [sent-127, score-0.493]
70 We may be able to reduce the amount of work by pretending the interval is narrower than it really is. [sent-128, score-0.236]
71 We therefore trim the interval by a constant, parameterized by the user as , before performing the comparison. [sent-129, score-0.168]
72 To construct the correlation matrix from the full data, each of the R records needs to be considered for each of the n attribute pairs. [sent-142, score-0.399]
73 We evaluate the performance of our algorithm 2 by adding the number of records that were actually scanned for all the attribute-pairs, and n dividing the total by R 2 . [sent-143, score-0.332]
74 Figure 2 shows that the amount of data the algorithm examines is a constant that does not depend on the size of the data-set. [sent-147, score-0.246]
75 Note that the reported usage is an average over the number of attributes. [sent-152, score-0.122]
76 However this does not mean that the same amount of data was inspected for every attribute-pair — the algorithm determines how much effort to invest in each edge separately. [sent-153, score-0.557]
77 The running time is plotted against the number of data attributes in Figure 3. [sent-155, score-0.294]
78 A linear relation is clearly seen, meaning that (at least for this particular data-generation scheme) the algorithm is successful in doing work that is proportional to the number of tree edges. [sent-156, score-0.35]
79 For our algorithm the risk is making the wrong decision about which edges to include in the resulting tree. [sent-158, score-0.264]
80 In particular, we can just run the original algorithm on a small sample of the data, and use the generated tree. [sent-161, score-0.139]
81 2e+06 20 records 40 60 80 100 120 140 160 number of attributes Figure 2: Data usage (indicative of absolute running Figure 3: Running time as a function of the number time), in attribute-pair units per attribute. [sent-164, score-0.631]
82 004 data usage Figure 4: Relative log-likelihood vs. [sent-176, score-0.159]
83 To examine this effect we have generated data as above, then ran a 30-fold cross-validation test for the trees our algorithm generated. [sent-181, score-0.252]
84 We also ran a sample-based algorithm on each of the folds. [sent-182, score-0.13]
85 This variant behaves just like the full-data algorithm, but instead examines just the fraction of it that adds up to the total amount of data used by our algorithm. [sent-183, score-0.218]
86 We see that our algorithm outperforms the sample-based algorithm, even though they are both using the same total amount of data. [sent-185, score-0.148]
87 The reason is that using the same amount of data for all edges assumes all attribute-pairs have the same variance. [sent-186, score-0.254]
88 This is in contrast to our algorithm, which determines the amount of data for each edge independently. [sent-187, score-0.442]
89 Apparently for some edges this decision is very easy, requiring just a small sample. [sent-188, score-0.185]
90 The sample-based algorithm would not put more effort into those high-variance edges, eventually making the wrong decision. [sent-190, score-0.185]
91 The baseline (0) is the log-likelihood of the tree grown by the original algorithm using the full data. [sent-193, score-0.389]
92 Keep in mind that the sample-based algorithm has been given an unfair advantage, compared with MIST: it knows how much data it needs to look at. [sent-195, score-0.188]
93 The alternative is to use a fixed amount (specified either as a fraction or as an absolute count), which is likely to be too much or too little. [sent-198, score-0.12]
94 √ × √ × √ × × × × × √ × each training fold, we ran our algorithm, followed by a sample-based algorithm that uses as much data as our algorithm did. [sent-218, score-0.247]
95 Table 1 shows whether the 99% confidence interval for the log-likelihood difference indicates that either of the algorithms outperforms the other. [sent-220, score-0.168]
96 6 Conclusion and future work We have presented an algorithm that applies a “probably approximately correct” approach to dependency-tree construction for numeric and categorical data. [sent-224, score-0.3]
97 Experiments in sets with up to millions of records and hundreds of attributes show it is capable of processing massive data-sets in time that is constant in the number of records, with just a minor loss in output quality. [sent-225, score-0.645]
98 While we have derived formulas for both numeric and categorical data, we currently do not allow both types of attributes to be present in a single network. [sent-230, score-0.373]
99 An accelerated Chow and Liu algorithm: fitting tree distributions to high dimensional sparse data. [sent-285, score-0.27]
100 Using Tarjan’s red rule for fast dependency tree construction. [sent-291, score-0.504]
wordName wordTfidf (topN-words)
[('edge', 0.337), ('mist', 0.336), ('tree', 0.27), ('records', 0.252), ('spanning', 0.194), ('attributes', 0.191), ('mst', 0.186), ('interval', 0.168), ('edges', 0.149), ('dence', 0.137), ('heaviest', 0.122), ('tarjan', 0.122), ('usage', 0.122), ('eliminate', 0.121), ('domingos', 0.106), ('attribute', 0.104), ('dependency', 0.1), ('categorical', 0.097), ('heavier', 0.092), ('pedro', 0.092), ('numeric', 0.085), ('trees', 0.085), ('algorithm', 0.08), ('oracle', 0.08), ('geoff', 0.08), ('mutual', 0.077), ('shrink', 0.073), ('intervals', 0.068), ('amount', 0.068), ('candidate', 0.067), ('andrew', 0.067), ('running', 0.066), ('hundreds', 0.065), ('rule', 0.064), ('partial', 0.062), ('examines', 0.061), ('inconclusive', 0.061), ('olor', 0.061), ('pelleg', 0.061), ('con', 0.059), ('sample', 0.059), ('translates', 0.058), ('zi', 0.056), ('weights', 0.054), ('capable', 0.053), ('yi', 0.053), ('conclusive', 0.053), ('intersect', 0.053), ('chow', 0.053), ('fraction', 0.052), ('graph', 0.051), ('ran', 0.05), ('census', 0.048), ('hoeffding', 0.048), ('bayes', 0.048), ('eliminated', 0.048), ('millions', 0.045), ('path', 0.045), ('coef', 0.045), ('relative', 0.045), ('maintain', 0.044), ('net', 0.044), ('correlation', 0.043), ('marina', 0.043), ('looking', 0.042), ('xi', 0.041), ('derive', 0.04), ('knows', 0.039), ('red', 0.039), ('grown', 0.039), ('dan', 0.039), ('massive', 0.039), ('heuristic', 0.038), ('construction', 0.038), ('listed', 0.037), ('nets', 0.037), ('data', 0.037), ('setup', 0.037), ('weight', 0.037), ('requiring', 0.036), ('cient', 0.036), ('moore', 0.036), ('monotonic', 0.036), ('effort', 0.035), ('wrong', 0.035), ('eventually', 0.035), ('slower', 0.034), ('xy', 0.034), ('invariant', 0.033), ('morgan', 0.033), ('look', 0.032), ('mining', 0.032), ('fast', 0.031), ('probably', 0.031), ('weaker', 0.031), ('ef', 0.03), ('friedman', 0.03), ('node', 0.03), ('speed', 0.03), ('search', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
2 0.19836813 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
3 0.10699195 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
4 0.10229916 85 nips-2002-Fast Kernels for String and Tree Matching
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.
5 0.093061149 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
Author: Dan Klein, Christopher D. Manning
Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.
6 0.091855668 107 nips-2002-Identity Uncertainty and Citation Matching
7 0.090890348 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
8 0.087735668 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
9 0.08753711 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
10 0.071500875 124 nips-2002-Learning Graphical Models with Mercer Kernels
11 0.068305835 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
12 0.06423869 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
13 0.063654177 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
14 0.062272355 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning
15 0.061451133 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
16 0.060805175 152 nips-2002-Nash Propagation for Loopy Graphical Games
17 0.059680577 74 nips-2002-Dynamic Structure Super-Resolution
18 0.057693057 10 nips-2002-A Model for Learning Variance Components of Natural Images
19 0.05752651 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
20 0.057386573 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
topicId topicWeight
[(0, -0.203), (1, -0.054), (2, -0.039), (3, 0.03), (4, -0.003), (5, 0.07), (6, -0.043), (7, -0.007), (8, -0.09), (9, -0.024), (10, 0.04), (11, 0.031), (12, 0.009), (13, -0.008), (14, -0.066), (15, -0.082), (16, 0.122), (17, 0.047), (18, 0.006), (19, -0.171), (20, -0.097), (21, 0.311), (22, -0.12), (23, 0.061), (24, -0.041), (25, -0.094), (26, -0.006), (27, -0.012), (28, 0.063), (29, -0.125), (30, 0.049), (31, -0.052), (32, 0.05), (33, 0.126), (34, -0.038), (35, -0.117), (36, -0.139), (37, -0.011), (38, -0.083), (39, 0.026), (40, -0.039), (41, -0.055), (42, -0.046), (43, 0.064), (44, -0.056), (45, 0.1), (46, 0.156), (47, 0.042), (48, -0.083), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.95658076 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
2 0.85331804 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
3 0.74565041 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
Author: Dan Klein, Christopher D. Manning
Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.
4 0.61901039 85 nips-2002-Fast Kernels for String and Tree Matching
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.
5 0.5121389 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
6 0.44136479 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
7 0.42165086 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
8 0.39039305 107 nips-2002-Identity Uncertainty and Citation Matching
9 0.38974053 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
10 0.36522484 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
11 0.35374895 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
12 0.32237399 152 nips-2002-Nash Propagation for Loopy Graphical Games
13 0.29834747 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
14 0.29124853 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning
15 0.28944859 4 nips-2002-A Differential Semantics for Jointree Algorithms
16 0.28799734 174 nips-2002-Regularized Greedy Importance Sampling
17 0.28508195 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
18 0.28288364 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
20 0.27229062 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
topicId topicWeight
[(3, 0.011), (11, 0.014), (23, 0.018), (42, 0.077), (54, 0.117), (55, 0.044), (57, 0.019), (68, 0.025), (74, 0.112), (81, 0.262), (87, 0.013), (92, 0.074), (98, 0.139)]
simIndex simValue paperId paperTitle
same-paper 1 0.79849601 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
2 0.66324562 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
3 0.65615547 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
4 0.65231156 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
5 0.6522724 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
6 0.65116316 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
7 0.65051293 10 nips-2002-A Model for Learning Variance Components of Natural Images
8 0.64915013 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
9 0.64765269 53 nips-2002-Clustering with the Fisher Score
10 0.64759606 110 nips-2002-Incremental Gaussian Processes
11 0.64601851 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
12 0.64514244 27 nips-2002-An Impossibility Theorem for Clustering
13 0.64375901 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
14 0.64298141 3 nips-2002-A Convergent Form of Approximate Policy Iteration
15 0.64249372 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
16 0.64222986 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
17 0.64153761 2 nips-2002-A Bilinear Model for Sparse Coding
18 0.64097023 124 nips-2002-Learning Graphical Models with Mercer Kernels
19 0.64006335 89 nips-2002-Feature Selection by Maximum Marginal Diversity
20 0.63974106 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs