nips nips2007 nips2007-193 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gertjan Burghouts, Arnold Smeulders, Jan-mark Geusebroek
Abstract: Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that Lp -norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Smeulders Intelligent Systems Lab Amsterdam Informatics Institute University of Amsterdam Jan-Mark Geusebroek † Abstract Assessing similarity between features is a key step in object recognition and scene categorization tasks. [sent-4, score-0.494]
2 We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. [sent-5, score-0.664]
3 In this paper, we will derive the contrary, and report the theoretical result that Lp -norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. [sent-7, score-0.789]
4 Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. [sent-8, score-0.338]
5 This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. [sent-9, score-0.455]
6 In all three major streams of current research - the recognition of known objects [13], assigning an object to a class [8, 24], or assigning a scene to a type [6, 25] - the problem is transposed into the equality of features derived from similarity functions. [sent-11, score-0.489]
7 Hence, besides the issue of feature distinctiveness, comparing two images heavily relies on such similarity functions. [sent-12, score-0.348]
8 We argue that knowledge on the distribution of distances generated by such similarity functions is even more important, as it is that knowledge which is crucial in deciding whether features are similar or not. [sent-13, score-0.691]
9 For example, Nowak and Jurie [21] establish whether one can draw conclusions on two never seen objects based on the similarity distances from known objects. [sent-14, score-0.553]
10 Where they build and traverse a randomized tree to establish region correspondence, one could alternatively use the distribution of similarity distances to establish whether features come from the mode or the tail of the distribution. [sent-15, score-0.762]
11 Better clustering algorithms significantly boost performance for object and scene categorization [12]. [sent-18, score-0.249]
12 Knowledge on the distribution of distances aids in the construction of good clustering algorithms. [sent-19, score-0.39]
13 Usually this is done by measuring invariant feature sets [9, 13, 24] at a predefined raster of regions in the image or at selected key points in the image [11, 13] as extensively evaluated [17]. [sent-23, score-0.459]
14 Then, the most expensive computational step is to compare these feature sets to the feature sets of the reference objects, object classes or scene types. [sent-33, score-0.733]
15 Usually this is done by going over all entries in the image to all entries in the reference set and select the best matching pair. [sent-34, score-0.24]
16 Knowledge of the distribution of similarity distances and having established its parameters enables a remarkable speed-up in the search for matching reference points and hence for matching images. [sent-35, score-0.715]
17 When verifying that a given reference key-point or region is statistically unlikely to occur in this image, one can move on to search in the next image. [sent-36, score-0.212]
18 Hence, apart from obtaining theoretical insights in the general distribution of similarities, the results derived in this paper are directly applicable in object and scene recognition. [sent-40, score-0.224]
19 Intuitively one would expect that the set of all similarity values to a key-point or region in an image would assume any distribution. [sent-41, score-0.341]
20 We will prove that under broad conditions the similarity values to a given reference point or region adhere to a class of distributions known as the Weibull distribution family. [sent-44, score-0.598]
21 Our work on density distributions of similarity values compares to the work by Pekalska and Duin [23] assuming a Gaussian distribution for similarities. [sent-48, score-0.3]
22 The second fact is a new result, which links these extreme value statistics to sums of correlated variables [2, 3]. [sent-52, score-0.272]
23 We exploit these two facts in order to derive the distribution family of similarity measures. [sent-53, score-0.218]
24 In Section 2, we overview literature on similarity distances and distance distributions. [sent-55, score-0.564]
25 In Section 3, we discuss the theory of distributions of similarity distances from one to other feature vectors. [sent-56, score-0.728]
26 In Section 4, we validate the resulting distribution experimentally for image feature vectors. [sent-57, score-0.409]
27 1 Related work Similarity distance measures To measure the similarity between two feature vectors, many distance measures have been proposed [15]. [sent-60, score-0.526]
28 The distance from one reference feature vector s to one other feature vector t can be formalized as: n |si − ti |p )1/p , d(s, t) = ( (1) i=1 where n and i are the dimensionality and indices of the vectors. [sent-62, score-0.832]
29 Let the random variable Dp represent distances d(s, t) where t is drawn from the random variable T representing feature vectors. [sent-63, score-0.581]
30 Independent of the reference feature vector s, the probability density function of Lp -distances will be denoted by f (Dp = d). [sent-64, score-0.4]
31 [7] have considered the Gamma distribution to model the L2 -distances from image 1 regions to one reference region: f (D2 = d) = β γ Γ(γ) dγ−1 e−d/β , where γ is the shape parameter, and β the scale parameter; Γ(·) denotes the Gamma function. [sent-67, score-0.397]
32 Although the distribution fits were shown to represent the region distances to some extent, the method lacks a theoretical motivation. [sent-69, score-0.427]
33 2 Based on the central limit theorem, Pekalska and Duin [23] assumed that Lp -distances between 2 2 1 feature vectors are normally distributed: f (Dp = d) = √2π β e−(d /β )/2 . [sent-70, score-0.344]
34 As the authors argue, the use of the central limit theorem is theoretically justified if the feature values are independent, identically distributed, and have limited variance. [sent-71, score-0.289]
35 Although feature values generally have limited variance, unfortunately, they cannot be assumed to be independent and/or identically distributed as we will show below. [sent-72, score-0.345]
36 3 Contribution of this paper Our contribution is the theoretical derivation of a parameterized distribution for Lp -norm distances between feature vectors. [sent-75, score-0.561]
37 In the experiments, we establish whether distances to image features adhere to this distribution indeed. [sent-76, score-0.704]
38 3 Statistics of distances between features In this section, we derive the distribution function family of Lp -distances from a reference feature vector to other feature vectors. [sent-78, score-1.053]
39 We consider the notation as used in the previous section, with t a feature vector drawn from the random variable T . [sent-79, score-0.254]
40 For each vector t, we consider the value at index i, ti , resulting in a random variable Ti . [sent-80, score-0.199]
41 The value of the reference vector at index i, si , can be interpreted as a sample of the random variable Ti . [sent-81, score-0.255]
42 The computation of distances from one to other vectors involves manipulations of the random variable Ti resulting in a new random variable: Xi = |si −Ti |p . [sent-82, score-0.445]
43 Furthermore, the computation of the distances D requires the summation of random I variables, and a reparameterization: D = ( i=1 Xi )1/p . [sent-83, score-0.326]
44 1 Statistics of sums As a starting point to derive the Lp -distance distribution function, we consider a lemma from statistics about the sum of random variables. [sent-86, score-0.189]
45 N Lemma 1 For non-identical and correlated random variables Xi , the sum i=1 Xi , with finite N , is distributed according to the generalized extreme value distribution, i. [sent-87, score-0.248]
46 This lemma is important for our purposes, as later the feature values will turn out to be non-identical and correlated indeed. [sent-94, score-0.39]
47 This lemma is equally important to our purpose, as later the feature values will turn out to be upper-bounded indeed. [sent-101, score-0.273]
48 The combination of Lemmas 1 and 2 yields the distribution of sums of non-identical, correlated and upper-bounded random variables, summarized in the following theorem. [sent-102, score-0.198]
49 Theorem 1 is the starting point to derive the distribution of Lp -norms from one reference vector to other feature vectors. [sent-110, score-0.442]
50 2 Lp -distances from one to other feature vectors Theorem 1 states that Y is Weibull-distributed, given that {Xi = |si − Ti |p }i∈[1,. [sent-112, score-0.29]
51 We argue that the random variables Xi = |si − Ti |p and Xj (i = j) are indeed non-identical, correlated and upper-bounded random variables when considering a set of values extracted from feature vectors at indices i and j: • Xi and Xj are upper-bounded. [sent-120, score-0.535]
52 Hence, for general feature vectors, the values at index i, Ti , are finite. [sent-122, score-0.233]
53 Corollary 1 For finite-length feature vectors with non-identical, correlated and upper-bounded values, Lp distances, for limited p, from one reference feature vector to other feature vectors adhere to the Weibull distribution. [sent-131, score-1.212]
54 3 Extending the class of features We extend the class of features for which the distances are Weibull-distributed. [sent-133, score-0.498]
55 We denote the PCA transform g(·) applied to a single feature vector as s′ = g(s). [sent-135, score-0.226]
56 i The experimental verification of the assumption that PCA-transformed feature values Ti′ and Tj′ , i = j are non-identically distributed is postponed to Section 4. [sent-138, score-0.348]
57 Our point here, is that we have assumed originally correlating feature values, but after the decorrelating PCA transform we are no longer dealing with correlated feature values Ti′ and Tj′ . [sent-140, score-0.648]
58 Corollary 2 For finite-length feature vectors with non-identical, correlated and upper-bounded values, and for PCA-transformations thereof, Lp distances, for limited p, from one to other features adhere to the Weibull distribution. [sent-149, score-0.635]
59 4 Heterogeneous feature vector data We extend the corollary to hold also for composite datasets of feature vectors. [sent-151, score-0.584]
60 Consider the composite dataset modelled by random variables {Tt }, where each random variable Tt represents nonidentical and correlated feature values. [sent-152, score-0.474]
61 Hence, from Corollary 2 it follows that feature vectors from each of the Tt can be fitted by a Weibull function f β,γ (d). [sent-153, score-0.29]
62 However, the distances to each of the Tt may have a different range and modus, as we will verify by experimentation in Section 4. [sent-154, score-0.392]
63 For heterogeneous distance data {Tt }, we obtain a mixture of Weibull functions [14]. [sent-156, score-0.209]
64 We include these features for two reasons as: a) they are performing well for realistic computer vision tasks and b) they provide different mechanisms to describe an image region [17]. [sent-159, score-0.331]
65 The region features are computed from regions detected by the Harris- and Hessian-affine regions, maximally stable regions (MSER), and intensity extrema-based regions (IBR) [18]. [sent-160, score-0.373]
66 We consider distances from 1 randomly drawn reference vector to 100 other randomly drawn feature vectors, which we repeat 1,000 times for generalization. [sent-163, score-0.699]
67 2 to show typical distributions of distances between features taken from single images. [sent-166, score-0.466]
68 1 Validation of the corollary assumptions for image features Intrinsic feature assumptions Corollary 2 rests on a few explicit assumptions. [sent-170, score-0.561]
69 We consider a set of feature vectors Tj and the differences at index i to a reference vector s: Xi = |si − Tji |p . [sent-173, score-0.505]
70 We establish the percentage of significantly correlating differences at a confidence level of 0. [sent-175, score-0.195]
71 We report for each feature the average percentage of difference values that correlate significantly with difference values at an other feature vector index. [sent-177, score-0.525]
72 We have illustrated that feature value differences are significantly correlated and significantly non-identically distributed. [sent-194, score-0.357]
73 We conclude that the assumptions of Corollary 2 about properties of feature vectors are realistic in practice, and that Weibull functions are expected to fit distance distributions well. [sent-195, score-0.522]
74 2 Inter-feature assumptions In Corollary 3, we have assumed that distances from one to other feature vectors are described well by a mixture of Weibulls, if the features are taken from different clusters in the data. [sent-198, score-0.832]
75 Here, we illustrate that clusters of feature vectors, and clusters of distances, occur in practice. [sent-199, score-0.199]
76 The distances are described well by a single Weibull distribution. [sent-201, score-0.326]
77 The same hold for distances from one to other regions computed from a man-made object, see Figure 2b. [sent-202, score-0.4]
78 In Figure 2c, we illustrate the distances of one to other regions computed from a composite image containing two types of regions. [sent-203, score-0.551]
79 This results in two modalitites of feature vectors hence of similarity distances. [sent-204, score-0.466]
80 The distance distribution is therefore bimodal, illustrating the general case of multimodality to be expected in realistic, heterogeneous image data. [sent-205, score-0.276]
81 We conclude that the assumptions of Corollary 3 are realistic in practice, and that the Weibull function, or a mixture, fits distance distributions well. [sent-206, score-0.232]
82 2 Validation of Weibull-shaped distance distributions In this experiment, we validate the fitting of Weibull distributions of distances from one reference feature vector to other vectors. [sent-208, score-0.927]
83 Table 1 indicates that for most of the feature types computed from various regions, more than 90% of the distance distributions is fit by a single Weibull function. [sent-215, score-0.342]
84 As expected, distances between each of the SPIN, SIFT, PCA-SIFT and GLOH features, are fitted well by Weibull distributions. [sent-216, score-0.326]
85 The distributions of distances between these two region/feature combinations tend to have multiple modes. [sent-218, score-0.38]
86 002 0 250 300 350 400 450 500 distances (a) 550 600 650 700 0 250 0. [sent-236, score-0.326]
87 002 300 350 400 450 500 distances (b) 550 600 650 700 0 250 300 350 400 450 500 distances 550 600 650 700 (c) Figure 2: Distance distributions from one randomly selected image region to other regions, each described by the SIFT feature. [sent-240, score-0.864]
88 The distance distribution is described by a single Weibull function for a natural scene (a) and a man-made object (b). [sent-241, score-0.313]
89 For a composite image, the distance distribution is bimodal (c). [sent-242, score-0.222]
90 5 Conclusion In this paper, we have derived that similarity distances between one and other image features in databases are Weibull distributed. [sent-249, score-0.654]
91 the SPIN, SIFT, GLOH and PCA-SIFT features, and for a large variety of images from the COREL image collection, we have demonstrated that the similarity distances from one to other features, computed from Lp norms, are Weibull-distributed. [sent-252, score-0.568]
92 Also, between PCA-transformed feature vectors, the distances are Weibull-distributed. [sent-254, score-0.525]
93 The Malahanobis distance is very similar to the L2 -norm computed in the PCA-transformed feature space. [sent-255, score-0.288]
94 Hence, we expect Mahalanobis distances to be Weibull distributed as well. [sent-256, score-0.382]
95 The resulting Weibull distributions are distinctively different from the distributions suggested in literature, as they are positively or negatively skewed while the Gamma [7] and normal [23] distributions are positively and non-skewed, respectively. [sent-258, score-0.22]
96 We experimentally have shown them to hold for various popular feature extraction algorithms, and for a diverse range of images. [sent-261, score-0.249]
97 This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements and speed-ups in object/scene recognition algorithms. [sent-262, score-0.267]
98 Generalised extreme value statistics and sum of correlated variables. [sent-276, score-0.2]
99 Using weibull mixture distributions to model heterogeneous survival data. [sent-344, score-0.743]
100 Robust scene categorization by learning image statistics in context. [sent-426, score-0.276]
wordName wordTfidf (topN-words)
[('weibull', 0.569), ('distances', 0.326), ('gloh', 0.203), ('feature', 0.199), ('spin', 0.184), ('sift', 0.166), ('lp', 0.16), ('similarity', 0.149), ('reference', 0.147), ('ti', 0.144), ('correlated', 0.117), ('scene', 0.115), ('adhere', 0.113), ('corollary', 0.101), ('image', 0.093), ('vectors', 0.091), ('gumbel', 0.09), ('distance', 0.089), ('features', 0.086), ('xi', 0.076), ('tt', 0.075), ('regions', 0.074), ('object', 0.073), ('pca', 0.073), ('correlating', 0.072), ('geusebroek', 0.068), ('ibr', 0.068), ('pekalska', 0.068), ('corel', 0.067), ('region', 0.065), ('mixture', 0.062), ('mikolajczyk', 0.059), ('frechet', 0.059), ('postponed', 0.059), ('composite', 0.058), ('heterogeneous', 0.058), ('distributed', 0.056), ('distributions', 0.054), ('si', 0.053), ('xj', 0.052), ('percentages', 0.05), ('experimentally', 0.05), ('establish', 0.05), ('realistic', 0.048), ('extreme', 0.048), ('shape', 0.047), ('arnold', 0.045), ('burghouts', 0.045), ('duin', 0.045), ('ferencz', 0.045), ('jurie', 0.045), ('mser', 0.045), ('nonidentical', 0.045), ('quadrant', 0.045), ('amsterdam', 0.045), ('sums', 0.045), ('tted', 0.044), ('cvpr', 0.044), ('assumptions', 0.041), ('differences', 0.041), ('argue', 0.04), ('lemma', 0.04), ('bimodal', 0.039), ('nowak', 0.039), ('smeulders', 0.039), ('vision', 0.039), ('recognition', 0.038), ('verify', 0.036), ('thereof', 0.036), ('cantly', 0.036), ('distribution', 0.036), ('tj', 0.035), ('statistics', 0.035), ('values', 0.034), ('schmid', 0.033), ('pattern', 0.033), ('derive', 0.033), ('categorization', 0.033), ('ts', 0.032), ('hints', 0.032), ('percentage', 0.032), ('dp', 0.031), ('validate', 0.031), ('gamma', 0.031), ('similarities', 0.031), ('experimentation', 0.03), ('opens', 0.03), ('established', 0.03), ('limited', 0.029), ('positively', 0.029), ('objects', 0.028), ('clustering', 0.028), ('variable', 0.028), ('variables', 0.027), ('central', 0.027), ('hence', 0.027), ('vector', 0.027), ('knowledge', 0.027), ('assumed', 0.027), ('density', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 193 nips-2007-The Distribution Family of Similarity Distances
Author: Gertjan Burghouts, Arnold Smeulders, Jan-mark Geusebroek
Abstract: Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that Lp -norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. 1
2 0.12472422 143 nips-2007-Object Recognition by Scene Alignment
Author: Bryan Russell, Antonio Torralba, Ce Liu, Rob Fergus, William T. Freeman
Abstract: Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input image, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a probabilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database. 1
3 0.093921751 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be: 1 1 λ−c (xs ) + max λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8
4 0.088808641 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
Author: Sujay Sanghavi, Dmitry Malioutov, Alan S. Willsky
Abstract: Loopy belief propagation has been employed in a wide variety of applications with great empirical success, but it comes with few theoretical guarantees. In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. This provides an exact characterization of max-product performance and reveals connections to the widely used optimization technique of LP relaxation. In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. 1
5 0.07248725 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1
7 0.070814505 61 nips-2007-Convex Clustering with Exemplar-Based Models
8 0.070105642 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
9 0.06699191 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
10 0.066264473 115 nips-2007-Learning the 2-D Topology of Images
11 0.065442823 128 nips-2007-Message Passing for Max-weight Independent Set
12 0.065440163 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
13 0.064960614 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
14 0.060894318 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
15 0.060596488 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation
16 0.055562921 183 nips-2007-Spatial Latent Dirichlet Allocation
17 0.054609463 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
18 0.054455593 109 nips-2007-Kernels on Attributed Pointsets with Applications
19 0.05338503 145 nips-2007-On Sparsity and Overcompleteness in Image Models
20 0.053095561 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
topicId topicWeight
[(0, -0.189), (1, 0.053), (2, -0.065), (3, -0.02), (4, -0.029), (5, -0.005), (6, 0.02), (7, 0.147), (8, -0.002), (9, 0.03), (10, -0.026), (11, -0.009), (12, -0.007), (13, 0.057), (14, -0.042), (15, 0.061), (16, 0.032), (17, 0.039), (18, -0.004), (19, -0.036), (20, 0.01), (21, 0.081), (22, 0.017), (23, -0.083), (24, -0.128), (25, -0.077), (26, -0.036), (27, -0.16), (28, -0.13), (29, -0.016), (30, 0.008), (31, 0.059), (32, 0.012), (33, -0.037), (34, 0.146), (35, 0.064), (36, -0.118), (37, -0.051), (38, -0.038), (39, 0.007), (40, -0.039), (41, 0.003), (42, 0.093), (43, 0.13), (44, 0.004), (45, 0.043), (46, -0.101), (47, 0.112), (48, 0.039), (49, -0.138)]
simIndex simValue paperId paperTitle
same-paper 1 0.95833737 193 nips-2007-The Distribution Family of Similarity Distances
Author: Gertjan Burghouts, Arnold Smeulders, Jan-mark Geusebroek
Abstract: Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that Lp -norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. 1
2 0.57574588 143 nips-2007-Object Recognition by Scene Alignment
Author: Bryan Russell, Antonio Torralba, Ce Liu, Rob Fergus, William T. Freeman
Abstract: Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input image, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a probabilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database. 1
3 0.54308736 196 nips-2007-The Infinite Gamma-Poisson Feature Model
Author: Michalis K. Titsias
Abstract: We present a probability distribution over non-negative integer valued matrices with possibly an infinite number of columns. We also derive a stochastic process that reproduces this distribution over equivalence classes. This model can play the role of the prior in nonparametric Bayesian learning scenarios where multiple latent features are associated with the observed data and each feature can have multiple appearances or occurrences within each data point. Such data arise naturally when learning visual object recognition systems from unlabelled images. Together with the nonparametric prior we consider a likelihood model that explains the visual appearance and location of local image patches. Inference with this model is carried out using a Markov chain Monte Carlo algorithm. 1
4 0.52049303 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
Author: Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar
Abstract: In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox’s partial likelihood. We devise two bounds on CI–one of which emerges directly from the properties of PH models–and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox’s partial likelihood also ends up (approximately) maximizing the CI. 1
5 0.51734668 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
Author: Jingrui He, Jaime G. Carbonell
Abstract: Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg’s Interleave method, the prior best technique in the sparse literature on this topic. 1
6 0.49409384 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
7 0.45909482 113 nips-2007-Learning Visual Attributes
8 0.44903633 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
9 0.44726396 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
10 0.44500941 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
11 0.43287671 56 nips-2007-Configuration Estimates Improve Pedestrian Finding
12 0.42061868 128 nips-2007-Message Passing for Max-weight Independent Set
13 0.4091146 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
15 0.37446097 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
16 0.37012374 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
17 0.36921382 115 nips-2007-Learning the 2-D Topology of Images
18 0.36793393 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
19 0.36734247 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
20 0.36678427 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
topicId topicWeight
[(5, 0.021), (13, 0.025), (16, 0.024), (19, 0.011), (21, 0.557), (31, 0.015), (35, 0.012), (47, 0.061), (83, 0.105), (87, 0.036), (90, 0.037)]
simIndex simValue paperId paperTitle
1 0.96245217 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
Author: Misha Ahrens, Maneesh Sahani
Abstract: Many perceptual processes and neural computations, such as speech recognition, motor control and learning, depend on the ability to measure and mark the passage of time. However, the processes that make such temporal judgements possible are unknown. A number of different hypothetical mechanisms have been advanced, all of which depend on the known, temporally predictable evolution of a neural or psychological state, possibly through oscillations or the gradual decay of a memory trace. Alternatively, judgements of elapsed time might be based on observations of temporally structured, but stochastic processes. Such processes need not be specific to the sense of time; typical neural and sensory processes contain at least some statistical structure across a range of time scales. Here, we investigate the statistical properties of an estimator of elapsed time which is based on a simple family of stochastic process. 1
2 0.95613807 32 nips-2007-Bayesian Co-Training
Author: Shipeng Yu, Balaji Krishnapuram, Harald Steck, R. B. Rao, Rómer Rosales
Abstract: We propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, unlike some previous multi-view learning methods. Furthermore, it can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. Experiments on toy data and real world data sets illustrate the benefits of this approach. 1
same-paper 3 0.95470482 193 nips-2007-The Distribution Family of Similarity Distances
Author: Gertjan Burghouts, Arnold Smeulders, Jan-mark Geusebroek
Abstract: Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that Lp -norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. 1
4 0.91196597 19 nips-2007-Active Preference Learning with Discrete Choice Data
Author: Brochu Eric, Nando D. Freitas, Abhijeet Ghosh
Abstract: We propose an active learning algorithm that learns a continuous valuation model from discrete preferences. The algorithm automatically decides what items are best presented to an individual in order to find the item that they value highly in as few trials as possible, and exploits quirks of human psychology to minimize time and cognitive burden. To do this, our algorithm maximizes the expected improvement at each query without accurately modelling the entire valuation surface, which would be needlessly expensive. The problem is particularly difficult because the space of choices is infinite. We demonstrate the effectiveness of the new algorithm compared to related active learning methods. We also embed the algorithm within a decision making tool for assisting digital artists in rendering materials. The tool finds the best parameters while minimizing the number of queries. 1
5 0.630786 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
Author: Bill Triggs, Jakob J. Verbeek
Abstract: Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labeling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labeling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labeling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labeled datasets which can be costly and troublesome to produce. We introduce a method for learning CRFs from datasets with many unlabeled nodes by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that effective models can be learned from fragmentary labelings and that incorporating top-down aggregate features significantly improves the segmentations. The resulting segmentations are compared to the state-of-the-art on three different image datasets. 1
6 0.61280483 69 nips-2007-Discriminative Batch Mode Active Learning
7 0.60456771 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
8 0.60195792 56 nips-2007-Configuration Estimates Improve Pedestrian Finding
9 0.59502774 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
10 0.59379834 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
11 0.58269477 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
12 0.58044547 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
13 0.57951719 136 nips-2007-Multiple-Instance Active Learning
14 0.57686269 175 nips-2007-Semi-Supervised Multitask Learning
15 0.57141119 97 nips-2007-Hidden Common Cause Relations in Relational Learning
16 0.57072574 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
17 0.5605244 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
18 0.55978036 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
19 0.55587488 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks
20 0.55166405 100 nips-2007-Hippocampal Contributions to Control: The Third Way