nips nips2002 nips2002-80 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. [sent-7, score-0.159]
2 The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. [sent-8, score-0.184]
3 , the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. [sent-11, score-0.391]
4 We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. [sent-12, score-0.697]
5 Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. [sent-14, score-0.333]
6 An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions. [sent-16, score-0.151]
7 In previous work [2], we have shown how to use convex combinations of tree-structured distributions in order to upper bound the log partition function. [sent-21, score-0.37]
8 In this paper, we apply similar ideas to upper bound the log probability of the MAP configuration. [sent-22, score-0.219]
9 As we show, this upper bound is met with equality whenever there is a configuration that is optimal for all trees, in which case it must also be a MAP configuration for the original problem. [sent-23, score-0.423]
10 , 3, 4, 5], a well-known method for attempting to compute the MAP configuration, one which is exact for trees but approximate for graphs with cycles. [sent-26, score-0.265]
11 One contribution of this paper is to develop a tree-reweighted max-product algorithm that attempts to find a collection of tree-structured problems that share a common optimum. [sent-28, score-0.242]
12 This algorithm, though similar to both the standard and attenuated max-product updates [4], differs in key ways. [sent-29, score-0.133]
13 The next two subsections provide background on exponential families and convex combinations. [sent-31, score-0.189]
14 In Section 2, we introduce the basic form of the upper bounds on the log probability of the MAP assignment, and then develop necessary and sufficient conditions for it to tight (i. [sent-32, score-0.298]
15 In Section 3, we develop tree-reweighted max-product algorithms for attempting to find a convex combination of trees that yields a tight bound. [sent-35, score-0.448]
16 We prove that for positive compatibility functions, the algorithm always has at least one fixed point; moreover, if a key uniqueness condition is satisfied, the configuration specified by a fixed point must be MAP optimal. [sent-36, score-0.39]
17 We also illustrate how the algorithm, like the standard max-product algorithm [5], can fail if the uniqueness condition is not satisfied. [sent-37, score-0.281]
18 We make use of the following exponential representation of a graph-structured distribution . [sent-49, score-0.111]
19 For some index set , we let denote a collection of potential functions defined on the cliques of , and let be a vector of real-valued weights on these potential functions. [sent-50, score-0.177]
20 The exponential family determined by is the collection of distributions . [sent-51, score-0.177]
21 ¡ ¡ where the indicator function is equal to one if case, the index set consists of the union of edge indices Of interest to us is the maximum a posteriori configuration . [sent-60, score-0.202]
22 Equivalently, we can express this MAP configuration as the solution of the integer program , where (2) Note that the function is the maximum of a collection of linear functions, and hence is convex [6] as a function of , which is a key property for our subsequent development. [sent-61, score-0.252]
23 2 Convex combinations of trees 'ye £ ye Let be a particular parameter vector for which we are interested in computing this section, we show how to derive upper bounds via the convexity of . [sent-63, score-0.697]
24 In denote £ £ jn ¡¡ ¦e ne £ ¥ $¦ n£ne£¨ e ¢ a particular spanning tree of , and let denote the set of all spanning trees. [sent-65, score-0.651]
25 For each spanning tree , let be an exponential parameter vector of the same dimension as that respects the structure of . [sent-66, score-0.522]
26 To be explicit, if is defined by an edge set , then must have zeros in all elements corresponding to edges not in . [sent-67, score-0.202]
27 However, given an edge belonging to two trees and , the quantity can be different than . [sent-68, score-0.297]
28 For compactness, let denote the full collection, where the notation specifies those subelements of corresponding to spanning tree . [sent-69, score-0.448]
29 se £ C p B£ ¡ £ @ y 1 2 FF i ye BA¦e 489 $ i E£ w ¡ Dye £ C BA ne 889 £ @ 12 © 3j¥ 7 3 6 # 8¨ 3 0 ¡ © ¥ 3C¥ A ¨ 5 ¡ 4 3S¥ )21 ~ $! [sent-70, score-0.285]
30 # "¦£ A u§n£ ¡ ¦A £ § Sne H ¡ £¨ H n©e ¡ © £ ¦£ © e ¤¦£ © In order to define a convex combination, we require a probability distribution over the set of spanning trees — that is, a vector such that . [sent-72, score-0.52]
31 For any distribution , we define its support, denoted by , to be the set of trees to which it assigns strictly positive probability. [sent-73, score-0.202]
32 In the sequel, we will also be interested in the probability that a given edge appears in a spanning tree chosen randomly under . [sent-74, score-0.543]
33 We let represent a vector of edge appearance probabilities, which must belong to the spanning tree polytope [see 2]. [sent-75, score-0.656]
34 We say that a distribution (or the vector ) is valid if for every edge . [sent-76, score-0.169]
35 % Sn£ y $x ¡ % # ¡ ¡ ¡ A convex combination of exponential parameter vectors is defined via the weighted sum , which we denote compactly as . [sent-77, score-0.224]
36 Of particular importance are collections of exponential parameters for which there exists a convex combination that is equal to . [sent-78, score-0.224]
37 Consider a target distribution in the minimal Ising form ; otherwise stated, the target distribution is specified by the minimal parameter , where the zeros represent the fact that for all . [sent-83, score-0.105]
38 The ¥ u ¡ # y e ¡ e 4 H y $PFjP)GFjG&GFjG H s&tr; ye p@ WV I I H H ¦ ¦ v ¡ i £ d cb cb cb `Y `Y xxyxw(v r utpihf srpf qe ihfge `Y a WV RA%u%d%u% # # # ye £ dusd# @ Q &%§ $# w5 ¡ ¥ d ¡ ¡ X WV WV TS TS TS U Figure 1. [sent-84, score-0.384]
39 A convex combination of four distributions ning tree , is used to approximate the target distribution , each defined by a spanon the single-cycle graph. [sent-85, score-0.478]
40 v )t &§&(&()(¡ A P v §% ¡ four possible spanning trees on a single cycle on four nodes are illustrated in Figure 1. [sent-86, score-0.442]
41 We define a set of associated exponential parameters as follows: ¡ RAG% % RAG% % ne £ ¡ ¡ % G# FF# @ hne ¡ I £ # # # ¡ H£ # % # FF# @ hane # # @1 ye ¡An£fe 49 % ¡ I§ 2 ! [sent-87, score-0.359]
42 C ye £ p i E£ # @ # @ ¡I¨ ¡ ane ¦£ ¡ H ne £ Finally, we choose for all we have for each edge, and 2 Optimal upper bounds With the set-up of the previous section, the basic form of the upper bounds follows by , we have the applying Jensen’s inequality [6]. [sent-90, score-0.354]
43 In this case, it is clear that the bound (3) is met with equality. [sent-95, score-0.168]
44 An , important implication is that the configuration also attains the maximum defining so that it is an optimal solution to the original problem. [sent-96, score-0.158]
45 More formally, for any exponential parameter vector , let be the collection of configurations that attain the maximum defining , defined as follows: (4) ¡ With this notation, the critical property is that the intersection of configurations optimal for all tree-structured problems is non-empty. [sent-98, score-0.355]
46 The bound of equation (3) is tight if and only if there exists a configuration that for each achieves the maximum defining . [sent-100, score-0.194]
47 Therefore, the bound is met with equality if to zero only when belongs to and only if achieves the maximum defining for all trees . [sent-106, score-0.431]
48 Proposition 1 motivates the following strategy: given a spanning tree distribution , find a collection of exponential parameters such that the following holds: (a) Admissibility: The pair satisfies . [sent-107, score-0.625]
49 ¡ If (for a fixed ) we are able to find a collection satisfying these two properties, then Proposition 1 guarantees that all configurations in the (non-empty) intersection achieve the maximum defining . [sent-109, score-0.212]
50 As discussed above, assuming that assigns strictly positive probability to every edge in the graph, satisfying the admissibility condition is not difficult. [sent-110, score-0.38]
51 It is the second condition of mutual optimality on all trees that poses the challenge. [sent-111, score-0.368]
52 3 Mutual agreement via equal max-marginals We now develop an algorithm that attempts to find, for a given spanning tree distribution , a collection satisfying both of these properties. [sent-112, score-0.725]
53 Interestingly, this algorithm is related to the ordinary max-product algorithm [3, 5], but differs in several key ways. [sent-113, score-0.178]
54 For each edge , the pairwise max-marginal is defined analogously as . [sent-121, score-0.132]
55 With these definitions, the max-marginal tree factorization [1] is given by: (6) ¦ @ One interpretation of the ordinary max-product algorithm for trees, as shown in our related work [5], is as computing this alternative representation. [sent-122, score-0.354]
56 $ @ ¡ ¥ ¥ 3j 3C Suppose moreover that for each node , the following uniqueness condition holds: , the max-marginal has a unique optimum . [sent-124, score-0.321]
57 2 Tree-reweighted max-product The tree-reweighted max-product method is a message-passing algorithm, with fixed points that specify a collection of tree exponential parameters satisfying the admissibility condition. [sent-127, score-0.547]
58 The defining feature of is that the associated tree distributions all share a common set of max-marginals. [sent-128, score-0.247]
59 In particular, for a given tree with edge set , the distribution is specified compactly by the subcollection as follows: ye i g ` £ ! [sent-129, score-0.451]
60 This mutual agreement on trees, in conjunction with the admissibility of , implies that is also the MAP configuration for . [sent-133, score-0.262]
61 For each edge , let be the message passed from node to node . [sent-135, score-0.457]
62 We use as follows: ¡ , with the quantity to specify a set of functions shorthand for the messages , the subcollection can be used to define a tree-structured distri, in a manner analogous to equation (7). [sent-145, score-0.2]
63 Given any collection as in equations (8a) and (8b), the convex combination to up to an additive constant. [sent-147, score-0.253]
64 It is sufficient [1, 5] to impose, for each edge , the each tree-distribution edgewise consistency condition . [sent-149, score-0.531]
65 In order to enforce this condition, we update the messages in the following manner: ¡ Algorithm 1 (Tree reweighted max-product). [sent-150, score-0.129]
66 ¡ l 6 8R 6 7 , update the messages as follows: e x CBf F D ! [sent-153, score-0.129]
67 $ $ $ l$ § $ j$ ¡ ¡ The message update equation (9) is similar to the standard max-product algorithm [3, 5]. [sent-159, score-0.251]
68 Indeed, if is actually a tree, then we must have for every edge , in which case equation (9) is precisely equivalent to the ordinary max-product update. [sent-160, score-0.321]
69 However, if has cycles, then it is impossible to have for every edge , so that the updates in equation (9) differ from ordinary max-product in some key ways. [sent-161, score-0.342]
70 First of all, the weight on the potential function is scaled by the (inverse of the) edge appearance probability . [sent-162, score-0.169]
71 Secondly, for each neighbor , the incoming message is scaled by the corresponding edge appearance probability . [sent-163, score-0.313]
72 Third of all, in sharp contrast to standard [3] and attenuated [4] max-product updates, the update of message — that is, from to along edge — depends on the reverse direction message from to along the same edge. [sent-164, score-0.532]
73 Despite these differences, the messages can be updated synchronously as in ordinary max-product. [sent-165, score-0.204]
74 It also possible to perform reparameterization updates over spanning trees, analogous to but distinct from those for ordinary max-product [5]. [sent-166, score-0.413]
75 Such tree-based updates can be terminated once the trees agree on a common configuration, which may happen prior to message convergence [7]. [sent-167, score-0.369]
76 3 Analysis of fixed points In related work [5], we established the existence of fixed points for the ordinary maxproduct algorithm for positive compatibility functions on an arbitrary graph. [sent-170, score-0.221]
77 For each spanning tree , the fixed point defines a tree-structured distribution via equation (7). [sent-175, score-0.484]
78 By Lemma 2, the elements of are edgewise consistent. [sent-176, score-0.246]
79 By the equivalence of edgewise and global consistency for trees [1], the subcollection are exact max-marginals for the tree-structured distribution . [sent-177, score-0.646]
80 As a consequence, the configuration must belong to for each tree , so that mutual agreement is satisfied. [sent-178, score-0.381]
81 By Lemma 1, is equal to , so that admissibility is the convex combination satisfied. [sent-179, score-0.278]
82 4 Failures of tree-reweighted max-product In all of our experiments so far, the message updates of equation (9), if suitably relaxed, have always converged. [sent-182, score-0.24]
83 If this assumption is not satisfied, we are no longer guaranteed that the mutual agreement condition is may be empty). [sent-184, score-0.22]
84 Observe that the symmetry of this construction ensures that satisfies the edgewise consistency condition (Lemma 2) for any . [sent-194, score-0.399]
85 For each of the three spanning trees of this graph, the collection defines a tree-structured distribution as in equation (7). [sent-195, score-0.544]
86 On the other hand, when , as illustrated in panel (c), any configuration that is edgewise optimal for all three edges must satisfy for all . [sent-199, score-0.387]
87 It should also be noted that, as shown in our related work [5], the standard max- I f © © 2 In a relaxed message update, we take an -step towards the new (log) message, where is the step size parameter. [sent-202, score-0.191]
88 To date, we have not been able to prove that relaxed updates will always converge. [sent-203, score-0.141]
89 (b) For , both and are node and edgewise optimal. [sent-209, score-0.318]
90 (c) For , no configurations are node and edgewise optimal on the full graph. [sent-210, score-0.352]
91 4 Discussion This paper demonstrated the utility of convex combinations of tree-structured distributions in upper bounding the log probability of the MAP configuration. [sent-212, score-0.3]
92 We developed a family of tree-reweighted max-product algorithms for computing optimal upper bounds. [sent-213, score-0.152]
93 In certain cases, the optimal upper bound is met with equality, and hence yields an exact MAP configuration for the original problem on the graph with cycles. [sent-214, score-0.411]
94 An important open question is to characterize the range of problems for which the upper bound is tight. [sent-215, score-0.22]
95 For problems involving a binary-valued random vector, we have isolated a class of problems for which the upper bound is guaranteed to be tight. [sent-216, score-0.252]
96 We have also investigated the Lagrangian dual associated with the upper bound (3). [sent-217, score-0.188]
97 Finally, the analysis and upper bounds of this paper can be extended in a straightforward manner to hypertrees of of higher width. [sent-219, score-0.177]
98 In this context, hypertree-reweighted forms of generalized max-product updates [see 5] can again be used to find optimal upper bounds, which (when they are tight) again yield exact MAP configurations. [sent-220, score-0.269]
99 A new class of upper bounds on the log partition function. [sent-240, score-0.208]
100 Tree consistency and bounds on the maxproduct algorithm and its generalizations. [sent-265, score-0.197]
wordName wordTfidf (topN-words)
[('guration', 0.324), ('cbf', 0.295), ('ye', 0.285), ('edgewise', 0.246), ('tree', 0.208), ('spanning', 0.203), ('con', 0.172), ('hv', 0.171), ('trees', 0.165), ('uniqueness', 0.163), ('wv', 0.153), ('message', 0.144), ('edge', 0.132), ('admissibility', 0.128), ('upper', 0.118), ('convex', 0.115), ('ordinary', 0.114), ('map', 0.111), ('collection', 0.103), ('vv', 0.103), ('gurations', 0.102), ('fge', 0.098), ('sye', 0.098), ('met', 0.098), ('satis', 0.092), ('messages', 0.09), ('condition', 0.086), ('ning', 0.083), ('exponential', 0.074), ('subcollection', 0.074), ('wainwright', 0.073), ('attenuated', 0.073), ('agreement', 0.072), ('node', 0.072), ('bound', 0.07), ('ff', 0.069), ('consistency', 0.067), ('tn', 0.065), ('equality', 0.064), ('mutual', 0.062), ('sn', 0.06), ('updates', 0.06), ('bounds', 0.059), ('exact', 0.057), ('optimality', 0.055), ('tight', 0.054), ('attains', 0.051), ('proposition', 0.05), ('duts', 0.049), ('ihf', 0.049), ('martinw', 0.049), ('nitions', 0.048), ('xed', 0.048), ('de', 0.047), ('relaxed', 0.047), ('ne', 0.046), ('ip', 0.045), ('attempting', 0.043), ('rag', 0.043), ('geb', 0.043), ('ising', 0.043), ('el', 0.042), ('cycle', 0.041), ('ge', 0.041), ('intersection', 0.041), ('ba', 0.04), ('must', 0.039), ('maxproduct', 0.039), ('eecs', 0.039), ('implication', 0.039), ('share', 0.039), ('update', 0.039), ('lemma', 0.038), ('appearance', 0.037), ('let', 0.037), ('distribution', 0.037), ('compatibility', 0.036), ('reparameterization', 0.036), ('posteriori', 0.036), ('equation', 0.036), ('develop', 0.036), ('combinations', 0.036), ('combination', 0.035), ('panel', 0.035), ('satisfying', 0.034), ('cf', 0.034), ('convexity', 0.034), ('optimal', 0.034), ('graph', 0.034), ('prove', 0.034), ('maximum', 0.034), ('jaakkola', 0.034), ('illustrated', 0.033), ('cb', 0.033), ('problems', 0.032), ('algorithm', 0.032), ('fx', 0.031), ('zeros', 0.031), ('nd', 0.031), ('log', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
2 0.19836813 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
3 0.10694863 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
4 0.097952381 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
Author: Tong Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.
5 0.097713523 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
6 0.09760195 85 nips-2002-Fast Kernels for String and Tree Matching
7 0.09497112 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
8 0.088919856 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
9 0.079170637 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
10 0.078284509 156 nips-2002-On the Complexity of Learning the Kernel Matrix
11 0.076571904 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
12 0.076205678 27 nips-2002-An Impossibility Theorem for Clustering
13 0.073465154 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
14 0.070450604 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
15 0.067265891 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
16 0.065933704 94 nips-2002-Fractional Belief Propagation
17 0.062383626 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
18 0.062076844 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
19 0.060047302 152 nips-2002-Nash Propagation for Loopy Graphical Games
20 0.05613514 174 nips-2002-Regularized Greedy Importance Sampling
topicId topicWeight
[(0, -0.178), (1, -0.067), (2, -0.055), (3, -0.001), (4, -0.014), (5, 0.064), (6, -0.043), (7, 0.039), (8, -0.15), (9, 0.002), (10, 0.086), (11, 0.007), (12, -0.008), (13, -0.076), (14, -0.031), (15, -0.079), (16, 0.131), (17, 0.022), (18, -0.05), (19, -0.186), (20, -0.126), (21, 0.317), (22, -0.116), (23, 0.017), (24, -0.055), (25, -0.05), (26, 0.075), (27, -0.042), (28, 0.129), (29, -0.107), (30, -0.073), (31, -0.055), (32, 0.033), (33, 0.04), (34, 0.021), (35, -0.177), (36, -0.176), (37, -0.051), (38, -0.067), (39, 0.028), (40, -0.016), (41, -0.006), (42, -0.01), (43, 0.171), (44, 0.014), (45, 0.002), (46, 0.099), (47, 0.003), (48, -0.076), (49, 0.088)]
simIndex simValue paperId paperTitle
same-paper 1 0.97240269 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
2 0.82530862 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
3 0.57355052 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
Author: Dan Klein, Christopher D. Manning
Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.
4 0.53261352 85 nips-2002-Fast Kernels for String and Tree Matching
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.
5 0.47796392 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
Author: Nicholas P. Hughes, David Lowe
Abstract: We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SS TRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropic distribution, and we provide a theoretical justification for this observation.
6 0.46700564 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
7 0.41309297 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
8 0.40405113 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
9 0.37858272 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
10 0.35225832 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
11 0.3379541 152 nips-2002-Nash Propagation for Loopy Graphical Games
12 0.32174194 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
13 0.31402522 4 nips-2002-A Differential Semantics for Jointree Algorithms
14 0.30326879 107 nips-2002-Identity Uncertainty and Citation Matching
15 0.29684889 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
16 0.27614987 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
17 0.27516782 142 nips-2002-Maximum Likelihood and the Information Bottleneck
18 0.26686394 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
19 0.26450461 174 nips-2002-Regularized Greedy Importance Sampling
20 0.26405755 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
topicId topicWeight
[(11, 0.024), (23, 0.012), (42, 0.053), (54, 0.148), (55, 0.039), (57, 0.022), (67, 0.014), (68, 0.029), (74, 0.099), (92, 0.085), (97, 0.262), (98, 0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.80129814 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
2 0.66938275 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
3 0.65728956 27 nips-2002-An Impossibility Theorem for Clustering
Author: Jon M. Kleinberg
Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1
4 0.65647298 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
5 0.65371692 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
6 0.65143758 53 nips-2002-Clustering with the Fisher Score
7 0.64980793 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
8 0.64762408 10 nips-2002-A Model for Learning Variance Components of Natural Images
9 0.64705813 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
10 0.64448279 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
11 0.64446139 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
12 0.64211011 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
13 0.64195478 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
14 0.6410861 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
15 0.64009881 124 nips-2002-Learning Graphical Models with Mercer Kernels
16 0.63976109 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
17 0.63951629 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
18 0.63827717 3 nips-2002-A Convergent Form of Approximate Policy Iteration
19 0.63826305 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
20 0.63770211 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories