nips nips2002 nips2002-80 knowledge-graph by maker-knowledge-mining

80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

Source: pdf

Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky

Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal conﬁguration in common. An important implication is that any such shared conﬁguration must also be the MAP conﬁguration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to ﬁnd convex combinations of tree-structured problems that share a common optimum. We give necessary and sufﬁcient conditions for a ﬁxed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. [sent-7, score-0.159]

2 The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. [sent-8, score-0.184]

3 , the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. [sent-11, score-0.391]

4 We prove that this upper bound is met with equality if and only if the tree problems share an optimal conﬁguration in common. [sent-12, score-0.697]

5 Next we develop a tree-reweighted max-product algorithm for attempting to ﬁnd convex combinations of tree-structured problems that share a common optimum. [sent-14, score-0.333]

6 An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions. [sent-16, score-0.151]

7 In previous work [2], we have shown how to use convex combinations of tree-structured distributions in order to upper bound the log partition function. [sent-21, score-0.37]

8 In this paper, we apply similar ideas to upper bound the log probability of the MAP conﬁguration. [sent-22, score-0.219]

9 As we show, this upper bound is met with equality whenever there is a conﬁguration that is optimal for all trees, in which case it must also be a MAP conﬁguration for the original problem. [sent-23, score-0.423]

10 , 3, 4, 5], a well-known method for attempting to compute the MAP conﬁguration, one which is exact for trees but approximate for graphs with cycles. [sent-26, score-0.265]

11 One contribution of this paper is to develop a tree-reweighted max-product algorithm that attempts to ﬁnd a collection of tree-structured problems that share a common optimum. [sent-28, score-0.242]

12 This algorithm, though similar to both the standard and attenuated max-product updates [4], differs in key ways. [sent-29, score-0.133]

13 The next two subsections provide background on exponential families and convex combinations. [sent-31, score-0.189]

14 In Section 2, we introduce the basic form of the upper bounds on the log probability of the MAP assignment, and then develop necessary and sufﬁcient conditions for it to tight (i. [sent-32, score-0.298]

15 In Section 3, we develop tree-reweighted max-product algorithms for attempting to ﬁnd a convex combination of trees that yields a tight bound. [sent-35, score-0.448]

16 We prove that for positive compatibility functions, the algorithm always has at least one ﬁxed point; moreover, if a key uniqueness condition is satisﬁed, the conﬁguration speciﬁed by a ﬁxed point must be MAP optimal. [sent-36, score-0.39]

17 We also illustrate how the algorithm, like the standard max-product algorithm [5], can fail if the uniqueness condition is not satisﬁed. [sent-37, score-0.281]

18 We make use of the following exponential representation of a graph-structured distribution . [sent-49, score-0.111]

19 For some index set , we let denote a collection of potential functions deﬁned on the cliques of , and let be a vector of real-valued weights on these potential functions. [sent-50, score-0.177]

20 The exponential family determined by is the collection of distributions . [sent-51, score-0.177]

21 ¡ ¡ where the indicator function is equal to one if case, the index set consists of the union of edge indices Of interest to us is the maximum a posteriori conﬁguration . [sent-60, score-0.202]

22 Equivalently, we can express this MAP conﬁguration as the solution of the integer program , where (2) Note that the function is the maximum of a collection of linear functions, and hence is convex [6] as a function of , which is a key property for our subsequent development. [sent-61, score-0.252]

23 2 Convex combinations of trees 'ye £ ye Let be a particular parameter vector for which we are interested in computing this section, we show how to derive upper bounds via the convexity of . [sent-63, score-0.697]

24 In denote £ £ jn ¡¡ ¦e ne £ ¥ $¦ n£ne£¨ e ¢ a particular spanning tree of , and let denote the set of all spanning trees. [sent-65, score-0.651]

25 For each spanning tree , let be an exponential parameter vector of the same dimension as that respects the structure of . [sent-66, score-0.522]

26 To be explicit, if is deﬁned by an edge set , then must have zeros in all elements corresponding to edges not in . [sent-67, score-0.202]

27 However, given an edge belonging to two trees and , the quantity can be different than . [sent-68, score-0.297]

28 For compactness, let denote the full collection, where the notation speciﬁes those subelements of corresponding to spanning tree . [sent-69, score-0.448]

29 se £ C p B£ ¡ £ @ y 1 2 FF i ye BA¦e 489 $ i E£ w ¡ Dye £ C BA ne 889 £ @ 12 © 3j¥ 7 3 6 # 8¨ 3 0 ¡ © ¥ 3C¥ A ¨ 5 ¡ 4 3S¥ )21 ~ $! [sent-70, score-0.285]

30 # "¦£ A u§n£ ¡ ¦A £ § Sne H ¡ £¨ H n©e ¡ © £ ¦£ © e ¤¦£ © In order to deﬁne a convex combination, we require a probability distribution over the set of spanning trees — that is, a vector such that . [sent-72, score-0.52]

31 For any distribution , we deﬁne its support, denoted by , to be the set of trees to which it assigns strictly positive probability. [sent-73, score-0.202]

32 In the sequel, we will also be interested in the probability that a given edge appears in a spanning tree chosen randomly under . [sent-74, score-0.543]

33 We let represent a vector of edge appearance probabilities, which must belong to the spanning tree polytope [see 2]. [sent-75, score-0.656]

34 We say that a distribution (or the vector ) is valid if for every edge . [sent-76, score-0.169]

35 % Sn£ y $x ¡ % # ¡ ¡ ¡ A convex combination of exponential parameter vectors is deﬁned via the weighted sum , which we denote compactly as . [sent-77, score-0.224]

36 Of particular importance are collections of exponential parameters for which there exists a convex combination that is equal to . [sent-78, score-0.224]

37 Consider a target distribution in the minimal Ising form ; otherwise stated, the target distribution is speciﬁed by the minimal parameter , where the zeros represent the fact that for all . [sent-83, score-0.105]

38 The ¥ u ¡ # y e ¡ e 4 H y $PFjP)GFjG&GFjG H s&tr; ye p@ WV I I H H ¦ ¦ v ¡ i £ d cb cb cb `Y `Y xxyxw(v r utpihf srpf qe ihfge `Y a WV RA%u%d%u% # # # ye £ dusd# @ Q &%§ $# w5 ¡ ¥ d ¡ ¡ X WV WV TS TS TS U Figure 1. [sent-84, score-0.384]

39 A convex combination of four distributions ning tree , is used to approximate the target distribution , each deﬁned by a spanon the single-cycle graph. [sent-85, score-0.478]

40 v )t &§&(&()(¡ A P v §% ¡ four possible spanning trees on a single cycle on four nodes are illustrated in Figure 1. [sent-86, score-0.442]

41 We deﬁne a set of associated exponential parameters as follows: ¡ RAG% % RAG% % ne £ ¡ ¡ % G# FF# @ hne ¡ I £ # # # ¡ H£ # % # FF# @ hane # # @1 ye ¡An£fe 49 % ¡ I§ 2 ! [sent-87, score-0.359]

42 C ye £ p i E£ # @ # @ ¡I¨ ¡ ane ¦£ ¡ H ne £ Finally, we choose for all we have for each edge, and 2 Optimal upper bounds With the set-up of the previous section, the basic form of the upper bounds follows by , we have the applying Jensen’s inequality [6]. [sent-90, score-0.354]

43 In this case, it is clear that the bound (3) is met with equality. [sent-95, score-0.168]

44 An , important implication is that the conﬁguration also attains the maximum deﬁning so that it is an optimal solution to the original problem. [sent-96, score-0.158]

45 More formally, for any exponential parameter vector , let be the collection of conﬁgurations that attain the maximum deﬁning , deﬁned as follows: (4) ¡ With this notation, the critical property is that the intersection of conﬁgurations optimal for all tree-structured problems is non-empty. [sent-98, score-0.355]

46 The bound of equation (3) is tight if and only if there exists a conﬁguration that for each achieves the maximum deﬁning . [sent-100, score-0.194]

47 Therefore, the bound is met with equality if to zero only when belongs to and only if achieves the maximum deﬁning for all trees . [sent-106, score-0.431]

48 Proposition 1 motivates the following strategy: given a spanning tree distribution , ﬁnd a collection of exponential parameters such that the following holds: (a) Admissibility: The pair satisﬁes . [sent-107, score-0.625]

49 ¡ If (for a ﬁxed ) we are able to ﬁnd a collection satisfying these two properties, then Proposition 1 guarantees that all conﬁgurations in the (non-empty) intersection achieve the maximum deﬁning . [sent-109, score-0.212]

50 As discussed above, assuming that assigns strictly positive probability to every edge in the graph, satisfying the admissibility condition is not difﬁcult. [sent-110, score-0.38]

51 It is the second condition of mutual optimality on all trees that poses the challenge. [sent-111, score-0.368]

52 3 Mutual agreement via equal max-marginals We now develop an algorithm that attempts to ﬁnd, for a given spanning tree distribution , a collection satisfying both of these properties. [sent-112, score-0.725]

53 Interestingly, this algorithm is related to the ordinary max-product algorithm [3, 5], but differs in several key ways. [sent-113, score-0.178]

54 For each edge , the pairwise max-marginal is deﬁned analogously as . [sent-121, score-0.132]

55 With these deﬁnitions, the max-marginal tree factorization [1] is given by: (6) ¦ @ One interpretation of the ordinary max-product algorithm for trees, as shown in our related work [5], is as computing this alternative representation. [sent-122, score-0.354]

56 $ @ ¡ ¥ ¥ 3j 3C Suppose moreover that for each node , the following uniqueness condition holds: , the max-marginal has a unique optimum . [sent-124, score-0.321]

57 2 Tree-reweighted max-product The tree-reweighted max-product method is a message-passing algorithm, with ﬁxed points that specify a collection of tree exponential parameters satisfying the admissibility condition. [sent-127, score-0.547]

58 The deﬁning feature of is that the associated tree distributions all share a common set of max-marginals. [sent-128, score-0.247]

59 In particular, for a given tree with edge set , the distribution is speciﬁed compactly by the subcollection as follows: ye i g ` £ ! [sent-129, score-0.451]

60 This mutual agreement on trees, in conjunction with the admissibility of , implies that is also the MAP conﬁguration for . [sent-133, score-0.262]

61 For each edge , let be the message passed from node to node . [sent-135, score-0.457]

62 We use as follows: ¡ , with the quantity to specify a set of functions shorthand for the messages , the subcollection can be used to deﬁne a tree-structured distri, in a manner analogous to equation (7). [sent-145, score-0.2]

63 Given any collection as in equations (8a) and (8b), the convex combination to up to an additive constant. [sent-147, score-0.253]

64 It is sufﬁcient [1, 5] to impose, for each edge , the each tree-distribution edgewise consistency condition . [sent-149, score-0.531]

65 In order to enforce this condition, we update the messages in the following manner: ¡ Algorithm 1 (Tree reweighted max-product). [sent-150, score-0.129]

66 ¡ l 6 8R 6 7 , update the messages as follows: e x CBf F D ! [sent-153, score-0.129]

67 $ $ $ l$ § $ j$ ¡ ¡ The message update equation (9) is similar to the standard max-product algorithm [3, 5]. [sent-159, score-0.251]

68 Indeed, if is actually a tree, then we must have for every edge , in which case equation (9) is precisely equivalent to the ordinary max-product update. [sent-160, score-0.321]

69 However, if has cycles, then it is impossible to have for every edge , so that the updates in equation (9) differ from ordinary max-product in some key ways. [sent-161, score-0.342]

70 First of all, the weight on the potential function is scaled by the (inverse of the) edge appearance probability . [sent-162, score-0.169]

71 Secondly, for each neighbor , the incoming message is scaled by the corresponding edge appearance probability . [sent-163, score-0.313]

72 Third of all, in sharp contrast to standard [3] and attenuated [4] max-product updates, the update of message — that is, from to along edge — depends on the reverse direction message from to along the same edge. [sent-164, score-0.532]

73 Despite these differences, the messages can be updated synchronously as in ordinary max-product. [sent-165, score-0.204]

74 It also possible to perform reparameterization updates over spanning trees, analogous to but distinct from those for ordinary max-product [5]. [sent-166, score-0.413]

75 Such tree-based updates can be terminated once the trees agree on a common conﬁguration, which may happen prior to message convergence [7]. [sent-167, score-0.369]

76 3 Analysis of ﬁxed points In related work [5], we established the existence of ﬁxed points for the ordinary maxproduct algorithm for positive compatibility functions on an arbitrary graph. [sent-170, score-0.221]

77 For each spanning tree , the ﬁxed point deﬁnes a tree-structured distribution via equation (7). [sent-175, score-0.484]

78 By Lemma 2, the elements of are edgewise consistent. [sent-176, score-0.246]

79 By the equivalence of edgewise and global consistency for trees [1], the subcollection are exact max-marginals for the tree-structured distribution . [sent-177, score-0.646]

80 As a consequence, the conﬁguration must belong to for each tree , so that mutual agreement is satisﬁed. [sent-178, score-0.381]

81 By Lemma 1, is equal to , so that admissibility is the convex combination satisﬁed. [sent-179, score-0.278]

82 4 Failures of tree-reweighted max-product In all of our experiments so far, the message updates of equation (9), if suitably relaxed, have always converged. [sent-182, score-0.24]

83 If this assumption is not satisﬁed, we are no longer guaranteed that the mutual agreement condition is may be empty). [sent-184, score-0.22]

84 Observe that the symmetry of this construction ensures that satisﬁes the edgewise consistency condition (Lemma 2) for any . [sent-194, score-0.399]

85 For each of the three spanning trees of this graph, the collection deﬁnes a tree-structured distribution as in equation (7). [sent-195, score-0.544]

86 On the other hand, when , as illustrated in panel (c), any conﬁguration that is edgewise optimal for all three edges must satisfy for all . [sent-199, score-0.387]

87 It should also be noted that, as shown in our related work [5], the standard max- I f © © 2 In a relaxed message update, we take an -step towards the new (log) message, where is the step size parameter. [sent-202, score-0.191]

88 To date, we have not been able to prove that relaxed updates will always converge. [sent-203, score-0.141]

89 (b) For , both and are node and edgewise optimal. [sent-209, score-0.318]

90 (c) For , no conﬁgurations are node and edgewise optimal on the full graph. [sent-210, score-0.352]

91 4 Discussion This paper demonstrated the utility of convex combinations of tree-structured distributions in upper bounding the log probability of the MAP conﬁguration. [sent-212, score-0.3]

92 We developed a family of tree-reweighted max-product algorithms for computing optimal upper bounds. [sent-213, score-0.152]

93 In certain cases, the optimal upper bound is met with equality, and hence yields an exact MAP conﬁguration for the original problem on the graph with cycles. [sent-214, score-0.411]

94 An important open question is to characterize the range of problems for which the upper bound is tight. [sent-215, score-0.22]

95 For problems involving a binary-valued random vector, we have isolated a class of problems for which the upper bound is guaranteed to be tight. [sent-216, score-0.252]

96 We have also investigated the Lagrangian dual associated with the upper bound (3). [sent-217, score-0.188]

97 Finally, the analysis and upper bounds of this paper can be extended in a straightforward manner to hypertrees of of higher width. [sent-219, score-0.177]

98 In this context, hypertree-reweighted forms of generalized max-product updates [see 5] can again be used to ﬁnd optimal upper bounds, which (when they are tight) again yield exact MAP conﬁgurations. [sent-220, score-0.269]

99 A new class of upper bounds on the log partition function. [sent-240, score-0.208]

100 Tree consistency and bounds on the maxproduct algorithm and its generalizations. [sent-265, score-0.197]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('guration', 0.324), ('cbf', 0.295), ('ye', 0.285), ('edgewise', 0.246), ('tree', 0.208), ('spanning', 0.203), ('con', 0.172), ('hv', 0.171), ('trees', 0.165), ('uniqueness', 0.163), ('wv', 0.153), ('message', 0.144), ('edge', 0.132), ('admissibility', 0.128), ('upper', 0.118), ('convex', 0.115), ('ordinary', 0.114), ('map', 0.111), ('collection', 0.103), ('vv', 0.103), ('gurations', 0.102), ('fge', 0.098), ('sye', 0.098), ('met', 0.098), ('satis', 0.092), ('messages', 0.09), ('condition', 0.086), ('ning', 0.083), ('exponential', 0.074), ('subcollection', 0.074), ('wainwright', 0.073), ('attenuated', 0.073), ('agreement', 0.072), ('node', 0.072), ('bound', 0.07), ('ff', 0.069), ('consistency', 0.067), ('tn', 0.065), ('equality', 0.064), ('mutual', 0.062), ('sn', 0.06), ('updates', 0.06), ('bounds', 0.059), ('exact', 0.057), ('optimality', 0.055), ('tight', 0.054), ('attains', 0.051), ('proposition', 0.05), ('duts', 0.049), ('ihf', 0.049), ('martinw', 0.049), ('nitions', 0.048), ('xed', 0.048), ('de', 0.047), ('relaxed', 0.047), ('ne', 0.046), ('ip', 0.045), ('attempting', 0.043), ('rag', 0.043), ('geb', 0.043), ('ising', 0.043), ('el', 0.042), ('cycle', 0.041), ('ge', 0.041), ('intersection', 0.041), ('ba', 0.04), ('must', 0.039), ('maxproduct', 0.039), ('eecs', 0.039), ('implication', 0.039), ('share', 0.039), ('update', 0.039), ('lemma', 0.038), ('appearance', 0.037), ('let', 0.037), ('distribution', 0.037), ('compatibility', 0.036), ('reparameterization', 0.036), ('posteriori', 0.036), ('equation', 0.036), ('develop', 0.036), ('combinations', 0.036), ('combination', 0.035), ('panel', 0.035), ('satisfying', 0.034), ('cf', 0.034), ('convexity', 0.034), ('optimal', 0.034), ('graph', 0.034), ('prove', 0.034), ('maximum', 0.034), ('jaakkola', 0.034), ('illustrated', 0.033), ('cb', 0.033), ('problems', 0.032), ('algorithm', 0.032), ('fx', 0.031), ('zeros', 0.031), ('nd', 0.031), ('log', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky

2 0.19836813 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

Author: Dan Pelleg, Andrew W. Moore

Abstract: We focus on the problem of efﬁcient learning of dependency trees. It is well-known that given the pairwise mutual information coefﬁcients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic conﬁdence interval on the coefﬁcients, which we derive by examining just a small sample of the data. The algorithm is able to ﬂag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without signiﬁcant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1

3 0.10694863 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

Author: Clayton Scott, Robert Nowak

Abstract: Classiﬁcation trees are one of the most popular types of classiﬁers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classiﬁcation trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classiﬁcation trees, called dyadic classiﬁcation trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classiﬁcation problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform signiﬁcantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.

4 0.097952381 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in ﬁnite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

5 0.097713523 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

Author: Tom Heskes

Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable ﬁxed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable ﬁxed points. We illustrate this with an example and discuss implications. 1

6 0.09760195 85 nips-2002-Fast Kernels for String and Tree Matching

7 0.09497112 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

8 0.088919856 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

9 0.079170637 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

10 0.078284509 156 nips-2002-On the Complexity of Learning the Kernel Matrix

11 0.076571904 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

12 0.076205678 27 nips-2002-An Impossibility Theorem for Clustering

13 0.073465154 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

14 0.070450604 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

15 0.067265891 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

16 0.065933704 94 nips-2002-Fractional Belief Propagation

17 0.062383626 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

18 0.062076844 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

19 0.060047302 152 nips-2002-Nash Propagation for Loopy Graphical Games

20 0.05613514 174 nips-2002-Regularized Greedy Importance Sampling

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.178), (1, -0.067), (2, -0.055), (3, -0.001), (4, -0.014), (5, 0.064), (6, -0.043), (7, 0.039), (8, -0.15), (9, 0.002), (10, 0.086), (11, 0.007), (12, -0.008), (13, -0.076), (14, -0.031), (15, -0.079), (16, 0.131), (17, 0.022), (18, -0.05), (19, -0.186), (20, -0.126), (21, 0.317), (22, -0.116), (23, 0.017), (24, -0.055), (25, -0.05), (26, 0.075), (27, -0.042), (28, 0.129), (29, -0.107), (30, -0.073), (31, -0.055), (32, 0.033), (33, 0.04), (34, 0.021), (35, -0.177), (36, -0.176), (37, -0.051), (38, -0.067), (39, 0.028), (40, -0.016), (41, -0.006), (42, -0.01), (43, 0.171), (44, 0.014), (45, 0.002), (46, 0.099), (47, 0.003), (48, -0.076), (49, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97240269 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky

2 0.82530862 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

Author: Dan Pelleg, Andrew W. Moore

3 0.57355052 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

Author: Dan Klein, Christopher D. Manning

Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efﬁcient, exact inference.

4 0.53261352 85 nips-2002-Fast Kernels for String and Tree Matching

Author: Alex J. Smola, S.v.n. Vishwanathan

Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

5 0.47796392 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

Author: Nicholas P. Hughes, David Lowe

Abstract: We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical ﬁeld of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SS TRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropic distribution, and we provide a theoretical justiﬁcation for this observation.

6 0.46700564 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

7 0.41309297 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

8 0.40405113 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

9 0.37858272 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

10 0.35225832 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

11 0.3379541 152 nips-2002-Nash Propagation for Loopy Graphical Games

12 0.32174194 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

13 0.31402522 4 nips-2002-A Differential Semantics for Jointree Algorithms

14 0.30326879 107 nips-2002-Identity Uncertainty and Citation Matching

15 0.29684889 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

16 0.27614987 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

17 0.27516782 142 nips-2002-Maximum Likelihood and the Information Bottleneck

18 0.26686394 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

19 0.26450461 174 nips-2002-Regularized Greedy Importance Sampling

20 0.26405755 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.024), (23, 0.012), (42, 0.053), (54, 0.148), (55, 0.039), (57, 0.022), (67, 0.014), (68, 0.029), (74, 0.099), (92, 0.085), (97, 0.262), (98, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80129814 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky

2 0.66938275 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

3 0.65728956 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very diﬃcult to develop a uniﬁed framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the diﬃculty in ﬁnding such a uniﬁcation, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-oﬀs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

4 0.65647298 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

5 0.65371692 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classiﬁer design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classiﬁcation. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classiﬁers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classiﬁcation performance.

6 0.65143758 53 nips-2002-Clustering with the Fisher Score

7 0.64980793 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

8 0.64762408 10 nips-2002-A Model for Learning Variance Components of Natural Images

9 0.64705813 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

10 0.64448279 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

11 0.64446139 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

12 0.64211011 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

13 0.64195478 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

14 0.6410861 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

15 0.64009881 124 nips-2002-Learning Graphical Models with Mercer Kernels

16 0.63976109 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

17 0.63951629 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

18 0.63827717 3 nips-2002-A Convergent Form of Approximate Policy Iteration

19 0.63826305 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

20 0.63770211 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories