nips nips2003 nips2003-189 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Tree-structured approximations by expectation propagation Thomas Minka Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213 USA minka@stat. [sent-1, score-0.249]
2 edu Abstract Approximation structure plays an important role in inference on loopy graphs. [sent-5, score-0.145]
3 As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. [sent-6, score-0.478]
4 However, belief propagation represents each factor of the graph with a product of single-node messages. [sent-8, score-0.369]
5 In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. [sent-9, score-0.886]
6 That is, each factor sends a “message” to all pairs of nodes in a tree structure. [sent-10, score-0.425]
7 The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. [sent-11, score-0.234]
8 1 Introduction An important problem in approximate inference is improving the performance of belief propagation on loopy graphs. [sent-12, score-0.391]
9 Empirical studies have shown that belief propagation (BP) tends not to converge on graphs with strong positive and negative correlations (Welling & Teh, 2001). [sent-13, score-0.327]
10 The expectation propagation (EP) framework (Minka, 2001a) gives another interpretation of BP, as an algorithm which approximates multi-variable factors by single-variable factors ˜ ˜ (f (x1 , x2 ) → f1 (x1 )f2 (x2 )). [sent-16, score-0.461]
11 The resulting algorithm resembles BP on a graph of node clusters, where again multi-variable ˜ ˜ factors are decomposed into independent parts (f (x1 , x2 , x3 ) → f1 (x1 )f23 (x2 , x3 )). [sent-22, score-0.222]
12 In this paper, the target approximation of BP is enriched by exploiting the connection to expectation propagation. [sent-23, score-0.101]
13 Instead of approximating each factor by disconnected nodes or clusters, it is approximated by a tree distribution. [sent-24, score-0.468]
14 The algorithm is a strict generalization of belief propagation, because if the tree has no edges, then the results are identical to (loopy) belief propagation. [sent-25, score-0.511]
15 For example, Ghahramani & Jordan (1997) showed that tree structured approximations could improve the accuracy of variational bounds. [sent-27, score-0.444]
16 Their method, which sequentially projects graph potentials onto a tree structure, is closely related to expectation propagation and the method in this paper. [sent-31, score-0.68]
17 However, their method is not iterative and therefore sensitive to the order in which the potentials are sequenced. [sent-32, score-0.114]
18 In the first paper, a “message-free” version of BP was derived, which used multiple tree structures to propagate evidence. [sent-36, score-0.411]
19 In the second paper, tree structures were used to obtain an upper bound on the normalizing constant of a Markov network. [sent-38, score-0.346]
20 The following section describes the EP algorithm for updating the potentials of a tree approximation with known structure. [sent-40, score-0.531]
21 Section 3 then describes the method we use to choose the tree structure. [sent-41, score-0.339]
22 2 Updating the tree potentials This section describes an expectation-propagation algorithm to approximate a given distribution (of arbitrary structure) by a tree with known structure. [sent-43, score-0.802]
23 Denote the original distribution by p(x), written as a product of factors: fi (x) p(x) = (1) i For example, if p(x) is a Bayesian network or Markov network, the factors are conditional probability distributions or potentials which each depend on a small subset of the variables in x. [sent-47, score-0.555]
24 In this paper, the variables are assumed to be discrete, so that the factors fi (x) are simply multidimensional tables. [sent-48, score-0.412]
25 A useful way to organize these divisions is to construct a junction tree connecting the cliques (j, k) ∈ T (Jensen et al. [sent-53, score-0.607]
26 This tree has a different structure than T —the nodes in the junction tree represent cliques in T , and the edges in the junction tree represent variables which are shared between cliques. [sent-55, score-1.609]
27 These separator variables S in the junction tree are p q junction tree of q Figure 1: Approximating a complete graph p by a tree q. [sent-56, score-1.58]
28 The junction tree of q is used to organize computations. [sent-57, score-0.537]
29 We want to approximate the distribution p(x), which has a complete graph, by q(x), whose graph is a spanning tree. [sent-61, score-0.148]
30 The marginal representation of q can be directly read off of the junction tree: q(x) = 2. [sent-62, score-0.26]
31 Specifically, q tries to preserve the marginals and pairwise marginals of p: q(xj ) q(xj , xk ) ≈ p(xj ) ≈ p(xj , xk ) (j, k) ∈ T (4) (5) Expectation propagation is a general framework for approximating distributions of the form (1) by approximating the factors one by one. [sent-64, score-0.924]
32 The final approximation q is then the product of the approximate factors. [sent-65, score-0.117]
33 The functional form of the approximate factors is determined by considering the ratio of two different q’s. [sent-66, score-0.169]
34 In our case, this leads to approximations of the form ˜ (j,k)∈T fi (xj , xk ) ˜ fi (x) ≈ fi (x) = (6) ˜ fi (xs ) s∈S ˜ A product of such factors gives a distribution of the desired form (2). [sent-67, score-1.343]
35 Note that fi (xj , xk ) is not a proper marginal distribution, but just a non-negative function of two variables. [sent-68, score-0.455]
36 The algorithm starts by initializing the clique and separator potentials on the junction tree to 1. [sent-69, score-0.841]
37 If a factor in p only depends on one variable, or variables which are adjacent in T , then its approximation is trivial. [sent-70, score-0.162]
38 It can be multiplied into the corresponding clique potential right away and removed from further consideration. [sent-71, score-0.126]
39 The remaining factors in p, the off-tree ˜ factors, have their approximations fi initialized to 1. [sent-72, score-0.423]
40 Suppose all the potentials in p are pairwise, one for each edge. [sent-74, score-0.114]
41 The algorithm then iteratively passes through the off-tree factors in p, performing the fol˜ lowing three steps until all fi converge: ˜ (a) Deletion. [sent-77, score-0.369]
42 Remove fi from q to get an ‘old’ approximation q \i : q \i (xj , xk ) = q \i (xs ) = q(xj , xk ) (j, k) ∈ T ˜ fi (xj , xk ) q(xs ) s∈S ˜ fi (xs ) (7) (8) (b) Incorporate evidence. [sent-78, score-1.309]
43 Form the product fi (x)q \i (x), by considering f (x) as ‘evidence’ for the junction tree. [sent-79, score-0.489]
44 Propagate the evidence to obtain new clique marginals q(x j , xk ) and separators q(xs ) (details below). [sent-80, score-0.561]
45 Re-estimate fi by division: ˜ fi (xj , xk ) ˜ fi (xs ) 2. [sent-82, score-0.891]
46 3 = = q(xj , xk ) (j, k) ∈ T q \i (xj , xk ) q(xs ) s∈S q \i (xs ) (9) (10) Incorporating evidence by cutset conditioning The purpose of the “incorporate evidence” step is to find a distribution q minimizing KL(fi (x)q \i || q). [sent-83, score-0.578]
47 This is equivalent to matching the marginal distributions corresponding to each clique in q. [sent-84, score-0.162]
48 By definition, fi depends on a set of variables which are not adjacent in T , so the graph structure corresponding to fi (x)q \i (x) is not a tree, but has one or more loops. [sent-85, score-0.638]
49 One approach is to apply a generic exact inference algorithm to fi (x)q \i (x) to obtain the desired marginals, e. [sent-86, score-0.268]
50 construct a new junction tree in which fi (x) is a clique and propagate evidence in this tree. [sent-88, score-1.1]
51 But this does not exploit the fact that we already have a junction tree for q \i on which we can perform efficient inference. [sent-89, score-0.537]
52 The domain of fi (x) is the set of all possible assignments to V. [sent-92, score-0.236]
53 Find the clique (j, k) ∈ T which has the largest overlap with this domain—call this the root clique. [sent-93, score-0.175]
54 Then enumerate the rest of the domain V\(xj , xk ). [sent-94, score-0.183]
55 For each possible assignment to these variables, enter it as evidence in q’s junction tree and propagate to get marginals and an overall scale factor (which is the probability of that assignment). [sent-95, score-0.843]
56 When the variables V\(xj , xk ) are fixed, entering evidence simply reduces to zeroing out conflicting entries in the junction tree, and multiplying the root clique (j, k) by fi (x). [sent-96, score-0.997]
57 After propagating evidence multiple times, average the results together according to their scale factors, to get the final marginals and separators of q. [sent-97, score-0.252]
58 Continuing the example of figure 1, suppose we want to process edge (1, 2), whose factor is f1 (x1 , x2 ). [sent-98, score-0.093]
59 In one case, the evidence is (x1 = 0, f1 (0, x2 )) and in the other it is (x1 = 1, f1 (1, x2 )). [sent-103, score-0.136]
60 Propagating evidence for both cases and averaging the results gives the new junction tree potentials. [sent-104, score-0.673]
61 4 Within-loop propagation A further optimization is also used, by noting that evidence does not need to be propagated to the whole junction tree. [sent-109, score-0.577]
62 In particular, it only needs to be propagated within the subtree that connects the nodes in V. [sent-110, score-0.146]
63 Evidence propagated to the rest of the tree will be exactly canceled by the separators, so even though the potentials may change, the ratios in (2) will not. [sent-111, score-0.498]
64 For example, when we process edge (1, 2) in figure 1, there is no need to propagate evidence to clique (3, 4), because when q(x3 , x4 ) is divided by the separator q(x4 ), we have q(x3 |x4 ) which is the same before and after the evidence. [sent-112, score-0.422]
65 Thus evidence is propagated as follows: first collect evidence from V to the root, then distribute evidence from the root back to V, bypassing the rest of the tree (these operations are defined formally by Jensen et al. [sent-113, score-0.941]
66 In the example, this means we collect evidence from clique (1, 4) to the root (2, 4), then distribute back from (2, 4) to (1, 4), ignoring ˜ (3, 4). [sent-115, score-0.377]
67 This simplification also means that we don’t need to store fi for the cliques that are never updated by factor i. [sent-116, score-0.309]
68 When moving to the next factor, once we’ve designated the root for that factor, we collect evidence from the previous root. [sent-117, score-0.221]
69 In this way, the results are the same as if we always propagated evidence to the whole junction tree. [sent-118, score-0.431]
70 3 Choosing the tree structure This section describes a simple method to choose the tree structure. [sent-119, score-0.687]
71 The approach is based on Chow & Liu (1968): estimate the mutual information between adjacent nodes in p’s graph, call this the ‘weight’ of the edge between them, and then find the spanning tree with maximal total weight. [sent-122, score-0.475]
72 In our implementation, this is obtained from the product of factors involving only these two nodes, i. [sent-124, score-0.162]
73 The network and approximation will be the ones pictured in figure 1, with all nodes binary. [sent-130, score-0.127]
74 The potentials were chosen randomly and can be obtained from the authors’ website. [sent-131, score-0.114]
75 The proposed method (TreeEP) used the tree structure specified in figure 1. [sent-133, score-0.348]
76 Mean-field (MF) fit a variational bound with independent variables, and TreeVB fit a tree-structured variational bound, with the same structure as TreeEP. [sent-134, score-0.189]
77 TreeVB was implemented using the general method described by Wiegerinck (2000), with the same junction tree optimizations as in TreeEP. [sent-135, score-0.537]
78 Generalized belief propagation (GBP) was implemented using the parent-child algorithm of Yedidia et al. [sent-136, score-0.279]
79 We also used GBP to perform ordinary loopy belief propagation (BP). [sent-138, score-0.355]
80 474 Table 1: Node means estimated by various methods (TreeEP = the proposed method, BP = loopy belief propagation, GBP = generalized belief propagation on triangles, MF = meanfield, TreeVB = variational tree). [sent-172, score-0.499]
81 At each iteration, we then get an error bound, which is the maximum error from that iteration onwards. [sent-178, score-0.084]
82 The first iteration whose error bound is within 5% of the final error is chosen for the official FLOP count. [sent-179, score-0.084]
83 2 Complete graphs The next experiment tests the algorithms on complete graphs of varying size. [sent-186, score-0.136]
84 The graphs have random single-node and pairwise potentials, of the form fi (xj ) = [exp(θj ) exp(−θj )] exp(wjk ) exp(−wjk ) and fi (xj , xk ) = . [sent-187, score-0.74]
85 Figure 2(a) shows the approximation error as n increases. [sent-191, score-0.081]
86 For each n, 10 different potentials were drawn, giving 110 networks in all. [sent-192, score-0.114]
87 These errors are averaged over potentials and shown separately for each graph size. [sent-194, score-0.172]
88 The external fields θj were drawn as before, and the couplings wjk had standard deviation 1. [sent-218, score-0.126]
89 The GBP clusters were overlapping squares, as in Yedidia et al. [sent-219, score-0.103]
90 TreeVB performs consistently worse than BP, even though it is using the same tree structures as TreeEP. [sent-222, score-0.346]
91 We have hand-crafted tree structures that perform just as well on grids, but for simplicity we do not include these results. [sent-224, score-0.346]
92 5 Conclusions Tree approximation allows a smooth tradeoff between cost and accuracy in approximate inference. [sent-227, score-0.088]
93 TreeEP performs better than the corresponding variational bounds, because it minimizes the inclusive KL-divergence. [sent-230, score-0.152]
94 We hope that these results encourage more investigation into approximation structure for inference algorithms, such as finding the ‘optimal’ structure for a given problem. [sent-232, score-0.154]
95 There are many other opportunities for special approximation structure to be exploited, especially in hybrid networks, where not only do the independence assumptions matter but also the distributional forms. [sent-233, score-0.087]
96 Sequentially fitting inclusive trees for inference in noisy-OR networks. [sent-248, score-0.133]
97 Expectation propagation for approximate inference in dynamic Bayesian networks. [sent-259, score-0.214]
98 Belief optimization for binary networks: A stable alternative to loopy belief propagation. [sent-318, score-0.177]
99 Variational approximations between mean field theory and the junction tree algorithm. [sent-322, score-0.591]
100 Constructing free energy approximations and generalized belief propagation algorithms (Technical Report). [sent-338, score-0.299]
wordName wordTfidf (topN-words)
[('treeep', 0.388), ('tree', 0.313), ('treevb', 0.28), ('gbp', 0.27), ('bp', 0.266), ('fi', 0.236), ('junction', 0.224), ('xk', 0.183), ('flops', 0.15), ('propagation', 0.146), ('xs', 0.138), ('evidence', 0.136), ('factors', 0.133), ('clique', 0.126), ('potentials', 0.114), ('xj', 0.111), ('belief', 0.099), ('wjk', 0.094), ('minka', 0.091), ('welling', 0.078), ('loopy', 0.078), ('variational', 0.077), ('teh', 0.076), ('inclusive', 0.075), ('nodes', 0.075), ('propagated', 0.071), ('star', 0.071), ('clusters', 0.069), ('marginals', 0.068), ('wiegerinck', 0.068), ('propagate', 0.065), ('triangles', 0.065), ('separator', 0.064), ('graph', 0.058), ('yedidia', 0.057), ('approximations', 0.054), ('ep', 0.054), ('graphs', 0.054), ('approximation', 0.052), ('expectation', 0.049), ('wainwright', 0.049), ('root', 0.049), ('separators', 0.048), ('fastest', 0.045), ('grids', 0.044), ('variables', 0.043), ('cutset', 0.043), ('approximating', 0.043), ('kikuchi', 0.043), ('doesn', 0.043), ('mf', 0.043), ('murphy', 0.039), ('heskes', 0.037), ('zoeter', 0.037), ('yuille', 0.037), ('factor', 0.037), ('cliques', 0.036), ('collect', 0.036), ('marginal', 0.036), ('approximate', 0.036), ('jensen', 0.035), ('structure', 0.035), ('et', 0.034), ('proc', 0.034), ('chow', 0.034), ('triples', 0.034), ('stepsize', 0.034), ('willsky', 0.034), ('kappen', 0.034), ('conditioning', 0.033), ('nips', 0.033), ('structures', 0.033), ('edges', 0.033), ('frey', 0.032), ('damping', 0.032), ('couplings', 0.032), ('inference', 0.032), ('ordinary', 0.032), ('node', 0.031), ('pairwise', 0.031), ('edge', 0.031), ('kl', 0.031), ('distribute', 0.03), ('toolbox', 0.03), ('adjacent', 0.03), ('jaakkola', 0.03), ('gure', 0.029), ('error', 0.029), ('product', 0.029), ('converge', 0.028), ('complete', 0.028), ('tries', 0.026), ('squares', 0.026), ('trees', 0.026), ('spanning', 0.026), ('iteration', 0.026), ('updating', 0.026), ('ghahramani', 0.026), ('describes', 0.026), ('suppose', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 189 nips-2003-Tree-structured Approximations by Expectation Propagation
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
2 0.26664162 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
3 0.24693219 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
4 0.19757882 169 nips-2003-Sample Propagation
Author: Mark A. Paskin
Abstract: Rao–Blackwellization is an approximation technique for probabilistic inference that flexibly combines exact inference with sampling. It is useful in models where conditioning on some of the variables leaves a simpler inference problem that can be solved tractably. This paper presents Sample Propagation, an efficient implementation of Rao–Blackwellized approximate inference for a large class of models. Sample Propagation tightly integrates sampling with message passing in a junction tree, and is named for its simple, appealing structure: it walks the clusters of a junction tree, sampling some of the current cluster’s variables and then passing a message to one of its neighbors. We discuss the application of Sample Propagation to conditional Gaussian inference problems such as switching linear dynamical systems. 1
5 0.19347258 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
Author: Chen Yanover, Yair Weiss
Abstract: Loopy belief propagation (BP) has been successfully used in a number of difficult graphical models to find the most probable configuration of the hidden variables. In applications ranging from protein folding to image analysis one would like to find not just the best configuration but rather the top M . While this problem has been solved using the junction tree formalism, in many real world problems the clique size in the junction tree is prohibitively large. In this work we address the problem of finding the M best configurations when exact inference is impossible. We start by developing a new exact inference algorithm for calculating the best configurations that uses only max-marginals. For approximate inference, we replace the max-marginals with the beliefs calculated using max-product BP and generalized BP. We show empirically that the algorithm can accurately and rapidly approximate the M best configurations in graphs with hundreds of variables. 1
6 0.17917728 32 nips-2003-Approximate Expectation Maximization
7 0.15309238 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
8 0.14737134 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
9 0.14322023 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
10 0.12380782 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
11 0.094631761 30 nips-2003-Approximability of Probability Distributions
12 0.093682118 172 nips-2003-Semi-Supervised Learning with Trees
13 0.093589701 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
14 0.084762558 193 nips-2003-Variational Linear Response
15 0.084630258 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
16 0.079122335 100 nips-2003-Laplace Propagation
17 0.078463167 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
18 0.075562835 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
19 0.072139621 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
20 0.068330571 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
topicId topicWeight
[(0, -0.216), (1, -0.08), (2, -0.149), (3, 0.352), (4, 0.098), (5, -0.293), (6, 0.048), (7, 0.093), (8, 0.018), (9, 0.081), (10, 0.061), (11, 0.111), (12, -0.001), (13, -0.138), (14, -0.061), (15, -0.029), (16, 0.01), (17, 0.085), (18, -0.066), (19, 0.031), (20, 0.063), (21, 0.036), (22, 0.023), (23, -0.059), (24, 0.076), (25, 0.039), (26, 0.018), (27, -0.028), (28, 0.054), (29, -0.033), (30, 0.043), (31, -0.062), (32, 0.008), (33, 0.064), (34, 0.005), (35, 0.048), (36, -0.194), (37, 0.193), (38, -0.053), (39, -0.091), (40, -0.012), (41, -0.002), (42, 0.056), (43, -0.061), (44, -0.063), (45, 0.015), (46, -0.074), (47, -0.087), (48, -0.012), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.97453463 189 nips-2003-Tree-structured Approximations by Expectation Propagation
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
2 0.85160863 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
Author: Chen Yanover, Yair Weiss
Abstract: Loopy belief propagation (BP) has been successfully used in a number of difficult graphical models to find the most probable configuration of the hidden variables. In applications ranging from protein folding to image analysis one would like to find not just the best configuration but rather the top M . While this problem has been solved using the junction tree formalism, in many real world problems the clique size in the junction tree is prohibitively large. In this work we address the problem of finding the M best configurations when exact inference is impossible. We start by developing a new exact inference algorithm for calculating the best configurations that uses only max-marginals. For approximate inference, we replace the max-marginals with the beliefs calculated using max-product BP and generalized BP. We show empirically that the algorithm can accurately and rapidly approximate the M best configurations in graphs with hundreds of variables. 1
3 0.68819392 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
4 0.63898277 169 nips-2003-Sample Propagation
Author: Mark A. Paskin
Abstract: Rao–Blackwellization is an approximation technique for probabilistic inference that flexibly combines exact inference with sampling. It is useful in models where conditioning on some of the variables leaves a simpler inference problem that can be solved tractably. This paper presents Sample Propagation, an efficient implementation of Rao–Blackwellized approximate inference for a large class of models. Sample Propagation tightly integrates sampling with message passing in a junction tree, and is named for its simple, appealing structure: it walks the clusters of a junction tree, sampling some of the current cluster’s variables and then passing a message to one of its neighbors. We discuss the application of Sample Propagation to conditional Gaussian inference problems such as switching linear dynamical systems. 1
5 0.59412718 32 nips-2003-Approximate Expectation Maximization
Author: Tom Heskes, Onno Zoeter, Wim Wiegerinck
Abstract: We discuss the integration of the expectation-maximization (EM) algorithm for maximum likelihood learning of Bayesian networks with belief propagation algorithms for approximate inference. Specifically we propose to combine the outer-loop step of convergent belief propagation algorithms with the M-step of the EM algorithm. This then yields an approximate EM algorithm that is essentially still double loop, with the important advantage of an inner loop that is guaranteed to converge. Simulations illustrate the merits of such an approach. 1
6 0.5751099 117 nips-2003-Linear Response for Approximate Inference
7 0.47112459 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
8 0.4648582 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
9 0.45706689 30 nips-2003-Approximability of Probability Distributions
10 0.40121102 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
11 0.3805232 172 nips-2003-Semi-Supervised Learning with Trees
12 0.34987897 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
13 0.33851385 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
14 0.33119777 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
15 0.32386678 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
16 0.32185197 100 nips-2003-Laplace Propagation
17 0.31738847 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
18 0.30110782 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
19 0.27835333 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
20 0.26756641 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
topicId topicWeight
[(0, 0.067), (11, 0.022), (29, 0.018), (30, 0.017), (35, 0.094), (41, 0.208), (53, 0.087), (71, 0.107), (76, 0.094), (85, 0.083), (91, 0.073), (99, 0.05)]
simIndex simValue paperId paperTitle
1 0.88513571 140 nips-2003-Nonlinear Processing in LGN Neurons
Author: Vincent Bonin, Valerio Mante, Matteo Carandini
Abstract: According to a widely held view, neurons in lateral geniculate nucleus (LGN) operate on visual stimuli in a linear fashion. There is ample evidence, however, that LGN responses are not entirely linear. To account for nonlinearities we propose a model that synthesizes more than 30 years of research in the field. Model neurons have a linear receptive field, and a nonlinear, divisive suppressive field. The suppressive field computes local root-meansquare contrast. To test this model we recorded responses from LGN of anesthetized paralyzed cats. We estimate model parameters from a basic set of measurements and show that the model can accurately predict responses to novel stimuli. The model might serve as the new standard model of LGN responses. It specifies how visual processing in LGN involves both linear filtering and divisive gain control. 1 In t rod u ct i on According to a widely held view, neurons in lateral geniculate nucleus (LGN) operate linearly (Cai et al., 1997; Dan et al., 1996). Their response L(t) is the convolution of the map of stimulus contrast S(x,t) with a receptive field F(x,t): L (t ) = [ S∗ F ] ( 0, t ) The receptive field F(x,t) is typically taken to be a difference of Gaussians in space (Rodieck, 1965) and a difference of Gamma functions in time (Cai et al., 1997). This linear model accurately predicts the selectivity of responses for spatiotemporal frequency as measured with gratings (Cai et al., 1997; Enroth-Cugell and Robson, 1966). It also predicts the main features of responses to complex dynamic video sequences (Dan et al., 1996). 150 spikes/s Data Model Figure 1. Response of an LGN neuron to a dynamic video sequence along with the prediction made by the linear model. Stimuli were sequences from Walt Disney’s “Tarzan”. From Mante et al. (2002). The linear model, however, suffers from limitations. For example, consider the response of an LGN neuron to a complex dynamic video sequences (Figure 1). The response is characterized by long periods of relative silence interspersed with brief events of high firing rate (Figure 1, thick traces). The linear model (Figure 1, thin traces) successfully predicts the timing of these firing events but fails to account for their magnitude (Mante et al., 2002). The limitations of the linear model are not surprising since there is ample evidence that LGN responses are nonlinear. For instance, responses to drifting gratings saturate as contrast is increased (Sclar et al., 1990) and are reduced, or masked, by superposition of a second grating (Bonin et al., 2002). Moreover, responses are selective for stimulus size (Cleland et al., 1983; Hubel and Wiesel, 1961; Jones and Sillito, 1991) in a nonlinear manner (Solomon et al., 2002). We propose that these and other nonlinearities can be explained by a nonlinear model incorporating a nonlinear suppressive field. The qualitative notion of a suppressive field was proposed three decades ago by Levick and collaborators (1972). We propose that the suppressive field computes local root-mean-square contrast, and operates divisively on the receptive field output. Basic elements of this model appeared in studies of contrast gain control in retina (Shapley and Victor, 1978) and in primary visual cortex (Cavanaugh et al., 2002; Heeger, 1992; Schwartz and Simoncelli, 2001). Some of these notions have been applied to LGN (Solomon et al., 2002), to fit responses to a limited set of stimuli with tailored parameter sets. Here we show that a single model with fixed parameters predicts responses to a broad range of stimuli. 2 Mod el In the model (Figure 2), the linear response of the receptive field L(t) is divided by the output of the suppressive field. The latter is a measure of local root-mean-square contrast c local. The result of the division is a generator potential V (t ) = Vmax L (t ) c50 + clocal , where c 50 is a constant. F(x,t) Stimulus S(x,t) V0 L(t) Receptive Field R(t) Firing rate Rectification H(x,t) S*(x,t) c50 clocal Suppressive Field Filter Figure 2. Nonlinear model of LGN responses. The suppressive field operates on a filtered version of the stimulus, S*=S*H, where H is a linear filter and * denotes convolution. The squared output of the suppressive field is the local mean square (the local variance) of the filtered stimulus: clocal = S ( x, t ) G ( x ) dx dt , 2 * 2 ∫∫ where G(x) is a 2-dimensional Gaussian. Firing rate is a rectified version of generator potential, with threshold V thresh: R(t ) = V (t ) − Vthresh + . To test the nonlinear model, we recorded responses from neurons in the LGN of anesthetized paralyzed cats. Methods for these recordings were described elsewhere (Freeman et al., 2002). 3 Resu l t s We proceed in two steps: first we estimate model parameters by fitting the model to a large set of canonical data; second we fix model parameters and evaluate the model by predicting responses to a novel set of stimuli. B 60 40 20 0 0.01 0.2 4 Spatial Frequency (cpd) Response (spikes/s) Response (spikes/s) A 80 60 40 20 0 0.5 5 50 Temporal Frequency (Hz) Figure 3. Estimating the receptive field in an example LGN cell. Stimuli are gratings varying in spatial (A) and temporal (B) frequency. Responses are the harmonic component of spike trains at the grating temporal frequency. Error bars represent standard deviation of responses. Curves indicate model fit. Response (spikes/s) 0.25 0.50 0.75 Test contrast 100 80 60 40 20 0 1.00 100 D 80 60 40 20 0 0.5 Response (spikes/s) 50 0 C B 100 Response (spikes/s) Response (spikes/s) A 0.25 0.50 0.75 1.00 Mask contrast 100 80 60 40 20 4.0 32.0 Mask diameter (deg) 0 0.01 0.20 4.00 Mask spatial frequency (cpd) Figure 4. Estimating the suppressive field in the example LGN cell. Stimuli are sums of a test grating and a mask grating. Responses are the harmonic component of spike trains at the temporal frequency of test. A: Responses to test alone. B-D: Responses to test+mask as function of three mask attributes: contrast (B), diameter (C) and spatial frequency (D). Gray areas indicate baseline response (test alone, 50% contrast). Dashed curves are predictions of linear model. Solid curves indicate fit of nonlinear model. 3.1 C h a r a c te r i z i n g t he r e c e p ti v e fi e l d We obtain the parameters of the receptive field F(x,t) from responses to large drifting gratings (Figure 3). These stimuli elicit approximately constant output in the suppressive field, so they allow us to characterize the receptive field. Responses to different spatial frequencies constrain F(x,t) in space (Figure 3A). Responses to different temporal frequencies constrain F(x,t) in time (Figure 3B). 3.2 C h a r a c te r i z i n g t he s u p p r e s s i v e f i e l d To characterize the divisive stage, we start by measuring how responses saturate at high contrast (Figure 4A). A linear model cannot account for this contrast saturation (Figure 4A, dashed curve). The nonlinear model (Figure 4A, solid curve) captures saturation because increases in receptive field output are attenuated by increases in suppressive field output. At low contrast, no saturation is observed because the output of the suppressive field is dominated by the constant c 50. From these data we estimate the value of c50. To obtain the parameters of the suppressive field, we recorded responses to sums of two drifting gratings (Figure 4B-D): an optimal test grating at 50% contrast, which elicits a large baseline response, and a mask grating that modulates this response. Test and mask temporal frequencies are incommensurate so that they temporally label a test response (at the frequency of the test) and a mask response (at the frequency of the mask) (Bonds, 1989). We vary mask attributes and study how they affect the test responses. Increasing mask contrast progressively suppresses responses (Figure 4B). The linear model fails to account for this suppression (Figure 4B, dashed curve). The nonlinear model (Figure 4B, solid curve) captures it because increasing mask contrast increases the suppressive field output while the receptive field output (at the temporal frequency of the test) remains constant. With masks of low contrast there is little suppression because the output of the suppressive field is dominated by the constant c 50 . Similar effects are seen if we increase mask diameter. Responses decrease until they reach a plateau (Figure 4C). A linear model predicts no decrease (Figure 4C, dashed curve). The nonlinear model (Figure 4C, solid curve) captures it because increasing mask diameter increases the suppressive field output while it does not affect the receptive field output. A plateau is reached once masks extend beyond the suppressive field. From these data we estimate the size of the Gaussian envelope G(x) of the suppressive field. Finally, the strength of suppression depends on mask spatial frequency (Figure 4D). At high frequencies, no suppression is elicited. Reducing spatial frequency increases suppression. This dependence of suppression on spatial frequency is captured in the nonlinear model by the filter H(x,t). From these data we estimate the spatial characteristics of the filter. From similar experiments involving different temporal frequencies (not shown), we estimate the filter’s selectivity for temporal frequency. 3.3 P r e d i c ti n g r e s p o n s e s t o n o v e l s ti m u l i We have seen that with a fixed set of parameters the model provides a good fit to a large set of measurements (Figure 3 and Figure 4). We now test whether the model predicts responses to a set of novel stimuli: drifting gratings varying in contrast and diameter. Responses to high contrast stimuli exhibit size tuning (Figure 5A, squares): they grow with size for small diameters, reach a maximum value at intermediate diameter and are reduced for large diameters (Jones and Sillito, 1991). Size tuning , however, strongly depends on stimulus contrast (Solomon et al., 2002): no size tuning is observed at low contrast (Figure 5A, circles). The model predicts these effects (Figure 5A, curves). For large, high contrast stimuli the output of the suppressive field is dominated by c local, resulting in suppression of responses. At low contrast, c local is much smaller than c50, and the suppressive field does not affect responses. Similar considerations can be made by plotting these data as a function of contrast (Figure 5B). As predicted by the nonlinear model (Figure 5B, curves), the effect of increasing contrast depends on stimulus size: responses to large stimuli show strong saturation (Figure 5B, squares), whereas responses to small stimuli grow linearly (Figure 5B, circles). The model predicts these effects because only large, high contrast stimuli elicit large enough responses from the suppressive field to cause suppression. For small, low contrast stimuli, instead, the linear model is a good approximation. B 100 Response (spikes/s) A 80 60 40 20 0 0.50 4.00 32.00 Diameter (deg) 0.00 0.25 0.50 0.75 1.00 Contrast Figure 5. Predicting responses to novel stimuli in the example LGN cell. Stimuli are gratings varying in diameter and contrast, and responses are harmonic component of spike trains at grating temporal frequency. Curves show model predictions based on parameters as estimated in previous figures, not fitted to these data. A: Responses as function of diameter for different contrasts. B: Responses as function of contrast for different diameters. 3.4 M o d e l pe r f or m a nc e To assess model performance across neurons we calculate the percentage of variance in the data that is explained by the model (see Freeman et al., 2002 for methods). The model provides good fits to the data used to characterize the suppressive field (Figure 4), explaining more than 90% of the variance in the data for 9/13 cells (Figure 6A). Model parameters are then held fixed, and the model is used to predict responses to gratings of different contrast and diameter (Figure 5). The model performs well, explaining in 10/13 neurons above 90% of the variance in these novel data (Figure 6B, shaded histogram). The agreement between the quality of the fits and the quality of the predictions suggests that model parameters are well constrained and rules out a role of overfitting in determining the quality of the fits. To further confirm the performance of the model, in an additional 54 cells we ran a subset of the whole protocol, involving only the experiment for characterizing the receptive field (Figure 3), and the experiment involving gratings of different contrast and diameter (Figure 5). For these cells we estimate the suppressive field by fitting the model directly to the latter measurements. The model explains above 90% of the variance in these data in 20/54 neurons and more than 70% in 39/54 neurons (Figure 6B, white histogram). Considering the large size of the data set (more than 100 stimuli, requiring several hours of recordings per neuron) and the small number of free parameters (only 6 for the purpose of this work), the overall, quality of the model predictions is remarkable. Estimating the suppressive field A # cells 6 n=13 4 2 0 Size tuning at different contrasts 15 n=54 10 # cells B 5 0 0 50 100 Explained variance (%) Figure 6. Percentage of variance in data explained by model. A: Experiments to estimate the suppressive field. B: Experiments to test the model. Gray histogram shows quality of predictions. White histogram shows quality of fits. 4 Co n cl u si o n s The nonlinear model provides a unified description of visual processing in LGN neurons. Based on a fixed set of parameters, it can predict both linear properties (Figure 3), as well as nonlinear properties such as contrast saturation (Figure 4A) and masking (Figure 4B-D). Moreover, once the parameters are fixed, it predicts responses to novel stimuli (Figure 5). The model explains why responses are tuned for stimulus size at high contrast but not at low contrast, and it correctly predicts that only responses to large stimuli saturate with contrast, while responses to small stimuli grow linearly. The model implements a form of contrast gain control. A possible purpose for this gain control is to increase the range of contrast that can be transmitted given the limited dynamic range of single neurons. Divisive gain control may also play a role in population coding: a similar model applied to responses of primary visual cortex was shown to maximize independence of the responses across neurons (Schwartz and Simoncelli, 2001). We are working towards improving the model in two ways. First, we are characterizing the dynamics of the suppressive field, e.g. to predict how it responds to transient stimuli. Second, we are testing the assumption that the suppressive field computes root-mean-square contrast, a measure that solely depends on the secondorder moments of the light distribution. Our ultimate goal is to predict responses to complex stimuli such as those shown in Figure 1 and quantify to what degree the nonlinear model improves on the predictions of the linear model. Determining the role of visual nonlinearities under more natural stimulation conditions is also critical to understanding their function. The nonlinear model synthesizes more than 30 years of research. It is robust, tractable and generalizes to arbitrary stimuli. As a result it might serve as the new standard model of LGN responses. Because the nonlinearities we discussed are already present in the retina (Shapley and Victor, 1978), and tend to get stronger as one ascends the visual hierarchy (Sclar et al., 1990), it may also be used to study how responses take shape from one stage to another in the visual system. A c k n o w l e d g me n t s This work was supported by the Swiss National Science Foundation and by the James S McDonnell Foundation 21st Century Research Award in Bridging Brain, Mind & Behavior. References Bonds, A. B. (1989). Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Vis Neurosci 2, 41-55. Bonin, V., Mante, V., and Carandini, M. (2002). The contrast integration field of cat LGN neurons. Program No. 352.16. In Abstract Viewer/Itinerary Planner (Washington, DC, Society for Neuroscience). Cai, D., DeAngelis, G. C., and Freeman, R. D. (1997). Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. J Neurophysiol 78, 10451061. Cavanaugh, J. R., Bair, W., and Movshon, J. A. (2002). Selectivity and spatial distribution of signals from the receptive field surround in macaque v1 neurons. J Neurophysiol 88, 25472556. Cleland, B. G., Lee, B. B., and Vidyasagar, T. R. (1983). Response of neurons in the cat's lateral geniculate nucleus to moving bars of different length. J Neurosci 3, 108-116. Dan, Y., Atick, J. J., and Reid, R. C. (1996). Efficient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. J Neurosci 16, 3351-3362. Enroth-Cugell, C., and Robson, J. G. (1966). The contrast sensitivity of retinal ganglion cells of the cat. J Physiol (Lond) 187, 517-552. Freeman, T., Durand, S., Kiper, D., and Carandini, M. (2002). Suppression without Inhibition in Visual Cortex. Neuron 35, 759. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis Neurosci 9, 181-197. Hubel, D., and Wiesel, T. N. (1961). Integrative action in the cat's lateral geniculate body. J Physiol (Lond) 155, 385-398. Jones, H. E., and Sillito, A. M. (1991). The length-response properties of cells in the feline dorsal lateral geniculate nucleus. J Physiol (Lond) 444, 329-348. Levick, W. R., Cleland, B. G., and Dubin, M. W. (1972). Lateral geniculate neurons of cat: retinal inputs and physiology. Invest Ophthalmol 11, 302-311. Mante, V., Bonin, V., and Carandini, M. (2002). Responses of cat LGN neurons to plaids and movies. Program No. 352.15. In Abstract Viewer/Itinerary Planner (Washington, DC, Society for Neuroscience). Rodieck, R. W. (1965). Quantitative analysis of cat retina ganglion cell response to visual stimuli. Vision Res 5, 583-601. Schwartz, O., and Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat Neurosci 4, 819-825. Sclar, G., Maunsell, J. H. R., and Lennie, P. (1990). Coding of image contrast in central visual pathways of the macaque monkey. Vision Res 30, 1-10. Shapley, R. M., and Victor, J. D. (1978). The effect of contrast on the transfer properties of cat retinal ganglion cells. J Physiol 285, 275-298. Solomon, S. G., White, A. J., and Martin, P. R. (2002). Extraclassical receptive field properties of parvocellular, magnocellular, and koniocellular cells in the primate lateral geniculate nucleus. J Neurosci 22, 338-349.
same-paper 2 0.86681998 189 nips-2003-Tree-structured Approximations by Expectation Propagation
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
3 0.71315885 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
4 0.69762212 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
5 0.69069219 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
6 0.68851894 100 nips-2003-Laplace Propagation
7 0.6876955 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
8 0.68617451 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
9 0.68582636 117 nips-2003-Linear Response for Approximate Inference
10 0.68544561 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
11 0.68493265 78 nips-2003-Gaussian Processes in Reinforcement Learning
12 0.68155503 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
13 0.67987078 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
14 0.67742127 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
15 0.67697138 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
16 0.67416537 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
17 0.67410928 145 nips-2003-Online Classification on a Budget
18 0.67343217 41 nips-2003-Boosting versus Covering
19 0.67234379 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
20 0.67172354 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting