nips nips2009 nips2009-32 knowledge-graph by maker-knowledge-mining

32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning


Source: pdf

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. [sent-8, score-0.132]

2 It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. [sent-9, score-0.237]

3 Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. [sent-11, score-0.253]

4 We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. [sent-12, score-0.156]

5 Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. [sent-13, score-0.155]

6 7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. [sent-16, score-0.213]

7 The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. [sent-17, score-0.184]

8 This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. [sent-18, score-0.085]

9 1 Introduction Learning a pairwise similarity measure from data is a fundamental task in machine learning. [sent-19, score-0.116]

10 Pair distances underlie classification methods like nearest neighbors and kernel machines, and similarity learning has important applications for “query-by-example” in information retrieval. [sent-20, score-0.175]

11 For instance, a user may wish to find images that are similar to (but not identical copies of) an image she has; a user watching an online video may wish to find additional videos about the same subject. [sent-21, score-0.238]

12 A large number of previous studies of learning similarities have focused on metric learning, like in the case of a positive semidefinite matrix that defines a Mahalanobis distance [19]. [sent-24, score-0.175]

13 However, similarity learning algorithms are often evaluated in a context of ranking [16, 5]. [sent-25, score-0.144]

14 When the amount 1 of training data available is very small, adding positivity constraints for enforcing metric properties is useful for reducing overfitting and improving generalization. [sent-26, score-0.156]

15 With this view, we take here an approach that avoids imposing positivity or symmetry constraints on the learned similarity measure. [sent-28, score-0.275]

16 Some similarity learning algorithms assume that the available training data contains real-valued pairwise similarities or distances. [sent-29, score-0.179]

17 Here we focus on a weaker supervision signal: the relative similarity of different pairs [4]. [sent-30, score-0.155]

18 This signal is also easier to obtain, here we extract similarity information from pairs of images that share a common label or are retrieved in response to a common text query in an image search engine. [sent-31, score-0.415]

19 The current paper presents an approach for learning semantic similarity that scales up to two orders of magnitude larger than current published approaches. [sent-32, score-0.207]

20 Given two images p1 and p2 we measure similarity through a bilinear form p1 Wp2 , where the matrix W is not required to be positive, or even symmetric. [sent-34, score-0.27]

21 It minimizes a large margin target function based on the hinge loss, and converges to high quality similarity measures after being presented with a small fraction of the training pairs. [sent-37, score-0.188]

22 For web-scale datasets, OASIS can be trained on more than two million images within three days on a single CPU. [sent-39, score-0.191]

23 On this large scale dataset, human evaluations of OASIS learned similarity show that 35% of the ten nearest neighbors of a given image are semantically relevant to that image. [sent-40, score-0.36]

24 2 Learning Relative Similarity We consider the problem of learning a pairwise similarity function S, given supervision on the relative similarity between two pairs of images. [sent-41, score-0.271]

25 The algorithm is designed to scale well with the number of samples and the number of features, by using fast online updates and a sparse representation. [sent-42, score-0.084]

26 Formally, we are given a set of images P, where each image is represented as a vector p ∈ Rd . [sent-43, score-0.192]

27 We assume that we have access to an oracle that, given a query image pi ∈ P, can locate two other images, p+ ∈ P and p− ∈ P, such that p+ ∈ P is more relevant to pi ∈ P than p− ∈ P. [sent-44, score-0.54]

28 However, unlike methods that assume i i that a numerical value of the similarity is available, relevance(pi , pj ) ∈ R, we use this weaker form of supervision, and only assume that some pairs of images can be ranked by their relevance to a query image pi . [sent-46, score-0.729]

29 The relevance measure could reflect that the relevant image p+ belongs to the i same class of images as the query image, or reflect any other semantic property of the images. [sent-47, score-0.397]

30 Our goal is to learn a similarity function SW (pi , pj ) parameterized by W that assigns higher similarity scores to the pairs of more relevant images (with a safety margin), S(pi , p+ ) > S(pi , p− ) + 1 , i i ∀pi , p+ , p− ∈ P i i . [sent-48, score-0.453]

31 (1) In this paper, we consider a parametric similarity function that has a bi-linear form, SW (pi , pj ) ≡ pT W pj i (2) with W ∈ Rd×d . [sent-49, score-0.238]

32 Importantly, if the image vectors pi ∈ Rd are sparse, namely, the number of non-zero entries ki ≡ pi 0 is small, ki ≪ d, then the value of the score defined in Eq. [sent-50, score-0.478]

33 (1), we define a global loss LW that accumulates hinge losses over all possible triplets in + − the training set: LW ≡ (pi ,p+ ,p− )∈P 3 lW (pi , pi , pi ), with the loss for a single triplet being i i lW (pi , p+ , p− ) ≡ max 0, 1 − SW (pi , p+ ) + SW (pi , p− ) . [sent-54, score-0.48]

34 We study variants of OASIS that enforce symmetry or positivity in Sec. [sent-79, score-0.109]

35 3 Related Work Learning similarity using relative relevance has been intensively studied, and a few recent approaches aim to address learning at large scale. [sent-83, score-0.174]

36 For small-scale data, there are two main groups of similarity learning approaches. [sent-84, score-0.116]

37 Such approaches include Fisher’s Linear Discriminant Analysis (LDA), relevant component analysis (RCA) [1], supervised global metric learning [18], large margin nearest neighbor (LMNN) [16], and metric learning by collapsing classes [5] (MLCC). [sent-86, score-0.25]

38 Other constraints like sparseness are sometimes induced over the learned metric [14]. [sent-87, score-0.109]

39 The work in [4] further learns a weighting over local distance functions for every image in the training set. [sent-93, score-0.152]

40 Non linear image similarity learning was also studied in the context of dimensionality reduction, as in [8]. [sent-94, score-0.192]

41 This work is one of the closest work with respect to OASIS: it learns online a linear model of a [dis-]similarity 3 Query image Top 5 relevant images retrieved by OASIS Table 1: OASIS: Successful cases from the web dataset. [sent-96, score-0.298]

42 The relevant text queries for each image are shown beneath the image (not used in training). [sent-97, score-0.243]

43 Learning a semantic similarity function between images was also studied in [13]. [sent-100, score-0.279]

44 There, semantic similarity is learned by representing each image by the posterior probability distribution over a predefined set of semantic tags, and then computing the distance between two images as the distance between the two underlying posterior distributions. [sent-101, score-0.495]

45 The representation size of each image therefore grows with the number of semantic classes. [sent-102, score-0.123]

46 Then, to quantitatively compare the precision of OASIS with other, small-scale metric-learning methods, we tested OASIS using Caltech-256, a standard machine vision benchmark. [sent-106, score-0.117]

47 Broadly speaking, features are extracted by dividing each image into overlapping square blocks, representing each block by edge and color histograms, and finding the nearest block in a predefined set (dictionary) of d = 10, 000 vectors of such features. [sent-110, score-0.103]

48 An image is thus represented as the number of times each dictionary visual word was present in it, yielding vectors in Rd with an average of 70 non-zero values. [sent-111, score-0.108]

49 We evaluated the performance of all algorithms using precision-at-top-k, a standard ranking precision measure based on nearest neighbors. [sent-113, score-0.138]

50 For each query image in the test set, all other test images were ranked according to their similarity to the query image, and the number of same-class images among the top k images (the k nearest neighbors) is computed, and then averaged across test images. [sent-114, score-0.743]

51 We also calculated the mean average precision (mAP), a measure that is widely used in the information retrieval community. [sent-115, score-0.083]

52 7 million images scraped from the Google image search engine. [sent-118, score-0.233]

53 We collected a set of ∼150K anonymized text queries, and for each of these queries, we had access to a set of relevant images. [sent-119, score-0.094]

54 To compute an image-image relevance measure, we first obtained measures of relevance between images and text queries. [sent-120, score-0.251]

55 This was achieved by collecting anonymized clicks over images collected from the set of text queries. [sent-121, score-0.182]

56 We used this query-image click counts 4 C(query,image) to compute the (unnormalized) probability that two images are co-queried as Relevance(image,image) = C T C. [sent-122, score-0.116]

57 Table 1 shows the top five images as ranked by OASIS on two examples of query-images in the test set. [sent-129, score-0.148]

58 In these examples, OASIS captures similarity that goes beyond visual appearance: most top ranked images are about the same concept as the query image, even though that concept was never provided in a textual form, and is inferred in the viewers mind (“dog”, “snow”). [sent-130, score-0.352]

59 This shows that learning similarity across co-queried images can indeed capture the semantics of queries even if the queries are not explicitly used during training. [sent-131, score-0.32]

60 To obtain a quantitative evaluation of the ranking obtained by OASIS we created an evaluation benchmark, by asking human evaluators to mark if a set of candidate images were semantically relevant to a set of 25 popular image queries. [sent-132, score-0.431]

61 For each query image, evaluators were presented with the top-10 images ranked by OASIS, mixed with 10 random images. [sent-133, score-0.345]

62 Given the relevance ranking from 30 evaluators, we computed the precision of each OASIS rank as the fraction of people that marked each image as relevant to the query image. [sent-134, score-0.345]

63 On average across all queries and evaluators, OASIS rankings yielded precision of ∼ 40% at the top 10 ranked images. [sent-135, score-0.181]

64 As an estimate of an “upper bound” on the difficulty of the task, we also computed the precision obtained by human evaluators: For every evaluator, we used the rankings of all other evaluators as ground truth, to compute his precision. [sent-136, score-0.255]

65 As with the ranks of OASIS, we computed the fraction of evaluators that marked an image as relevant, and repeated this separately for every query and human evaluator, providing a measure of “coherence” per query. [sent-137, score-0.298]

66 1(a) shows the mean precision obtained by OASIS and human evaluators for every query in our data. [sent-139, score-0.305]

67 For some queries OASIS achieves precision that is very close to that of the mean human evaluator. [sent-140, score-0.152]

68 (a) 1 (b) Human precision OASIS precision fast LMNN (MNIST 10 categories) ~190 days nd projected extrapolation (2 poly) OASIS (Web data) runtime (min) precision 0. [sent-142, score-0.363]

69 2 2 days 3 hrs 3hrs 5 min 37sec 9sec 0 0 5 10 15 20 query ID (sorted by precision) 25 60 600 10K 100K 2M number of images (log scale) Figure 1: (a) Precision of OASIS and human evaluators, per query, using rankings of all (remaining) human evaluators as a ground truth. [sent-148, score-0.441]

70 We further studied how the runtime of OASIS scales with the size of the training set. [sent-153, score-0.094]

71 Each curve shows the precision at top k as a function of k neighbors. [sent-169, score-0.104]

72 The results are averaged across 5 train/test partitions (40 training images, 25 test images per class), error bars are standard error of the means (s. [sent-170, score-0.142]

73 2 Caltech256 Dataset To compare OASIS with small-scale methods we used the Caltech256 dataset [7], containing images collected from Google image search and from PicSearch. [sent-175, score-0.214]

74 Images were assigned to 257 categories and evaluated by humans in order to ensure image quality and relevance. [sent-177, score-0.092]

75 After we have pre-processed the images, and filtered images that were too small, we were left with 29461 images in 256 categories. [sent-178, score-0.232]

76 As a preprocessing phase, images were projected to a basis of the principal components (PCA) of the data, with no dimensionality reduction. [sent-184, score-0.134]

77 For OASIS, images from the same class were treated as similar. [sent-188, score-0.116]

78 Results reported below were obtained by selecting the best value of the hyper parameter and then training again on the full training set (40 images per class). [sent-198, score-0.184]

79 3 Symmetry and positivity The similarity matrix W learned by OASIS is not guaranteed to be positive or even symmetric. [sent-238, score-0.222]

80 Some applications, like ranking images by semantic relevance to a given image query are known to be non-symmetric when based on human judgement [15]. [sent-239, score-0.422]

81 However, in some applications symmetry or positivity constraints reflects a prior knowledge that may help in avoiding overfitting. [sent-240, score-0.108]

82 1 Symmetric similarities A simple approach to enforce symmetry is to project the OASIS model W onto the set of symmetric matrices W′ = sym(W) = 1 WT + W . [sent-244, score-0.107]

83 Alternatively, the asymmetric score function SW (pi , pj ) in lW can be replaced with a symmetric score ′ SW (pi , pj ) ≡ −(pi − pj )T W (pi − pj ) . [sent-246, score-0.293]

84 This can be seen be casting the batch ′ objective of LMNN, into an online setup, which has the form err(W ) = −ω · SW (pi , p+ ) + (1 − i + − ′ ω) · lW (pi , pi , pi ). [sent-253, score-0.41]

85 Figure 3(a) compares the precision of the different symmetric variants with the original OASIS. [sent-255, score-0.134]

86 The precision of Proj-Oasis was equivalent to that of OASIS, most likely since asymmetric OASIS actually converged to an almost-symmetric model (as measured by a symmetry index ρ(W) = sym(W) 2 = 0. [sent-257, score-0.146]

87 2 Positive similarity Most similarity learning approaches focus on learning metrics. [sent-261, score-0.232]

88 The matrix squareroot of W, AT A = W can then be used to project the data into a new space in which the Euclidean distance is equivalent to the W distance in the original space. [sent-263, score-0.087]

89 We experimented with positive variants of OASIS, where we repeatedly projected the learned model onto the set of PSD matrices, once every t iterations. [sent-264, score-0.086]

90 5 Discussion We have presented OASIS, a scalable algorithm for learning image similarity that captures both semantic and visual aspects of image similarity. [sent-277, score-0.331]

91 First, using a large margin online approach allows training to converge even after seeing a small fraction of potential pairs. [sent-279, score-0.1]

92 Second, the objective function of OASIS does not require the similarity measure to be a metric during training, although it appears to converge to a near-symmetric solution, whose positive projection is a good metric. [sent-280, score-0.228]

93 As such, it is more limited in its descriptive power and it is likely that classdependent similarity models could improve precision. [sent-283, score-0.116]

94 Large scale similarity learning, applied to images from a large variety of classes, could therefore be a useful tool to address real-world problems with a large number of classes. [sent-285, score-0.232]

95 This paper focused on the training part of metric learning. [sent-286, score-0.09]

96 To use the learned metric for ranking, an efficient procedure for scoring a large set of images is needed. [sent-287, score-0.203]

97 Learning globally-consistent local distance functions for shape-based image retrieval and classification. [sent-320, score-0.111]

98 A discriminative kernel-based model to rank images from text queries. [sent-330, score-0.135]

99 Distance metric learning for large margin nearest neighbor classification. [sent-394, score-0.119]

100 Fast solvers and efficient implementations for distance metric learning. [sent-401, score-0.099]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oasis', 0.863), ('pi', 0.182), ('lmnn', 0.138), ('lw', 0.138), ('evaluators', 0.125), ('images', 0.116), ('similarity', 0.116), ('lego', 0.109), ('precision', 0.083), ('image', 0.076), ('query', 0.072), ('psd', 0.066), ('sw', 0.065), ('metric', 0.064), ('pj', 0.061), ('mcml', 0.06), ('relevance', 0.058), ('wi', 0.049), ('semantic', 0.047), ('online', 0.046), ('runtime', 0.045), ('positivity', 0.044), ('queries', 0.044), ('mahalanobis', 0.043), ('symmetry', 0.042), ('million', 0.041), ('similarities', 0.037), ('matlab', 0.037), ('distance', 0.035), ('tested', 0.034), ('days', 0.034), ('euclidean', 0.033), ('semantically', 0.033), ('ranked', 0.032), ('neighbors', 0.032), ('triplet', 0.03), ('relevant', 0.028), ('ranking', 0.028), ('symmetric', 0.028), ('margin', 0.028), ('imposing', 0.028), ('vi', 0.028), ('nearest', 0.027), ('jain', 0.027), ('google', 0.027), ('projection', 0.026), ('training', 0.026), ('lagrangian', 0.026), ('ro', 0.025), ('anonymized', 0.025), ('evaluator', 0.025), ('logdet', 0.025), ('proj', 0.025), ('human', 0.025), ('quadratically', 0.023), ('kulis', 0.023), ('supervision', 0.023), ('learned', 0.023), ('variants', 0.023), ('classes', 0.023), ('mnist', 0.023), ('scales', 0.023), ('positive', 0.022), ('constraints', 0.022), ('rankings', 0.022), ('collected', 0.022), ('hrs', 0.022), ('orders', 0.021), ('asymmetric', 0.021), ('bilinear', 0.021), ('loss', 0.021), ('sparse', 0.021), ('curve', 0.021), ('sym', 0.02), ('gal', 0.02), ('pt', 0.019), ('text', 0.019), ('ki', 0.019), ('id', 0.019), ('hinge', 0.018), ('stopping', 0.018), ('projected', 0.018), ('rca', 0.018), ('scalability', 0.018), ('web', 0.017), ('fast', 0.017), ('core', 0.017), ('cpu', 0.017), ('implemented', 0.017), ('semide', 0.017), ('matrix', 0.017), ('visual', 0.016), ('hyper', 0.016), ('collapsing', 0.016), ('minutes', 0.016), ('pairs', 0.016), ('categories', 0.016), ('early', 0.016), ('dictionary', 0.016), ('learns', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

2 0.12101271 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

3 0.11573362 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

4 0.086246505 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

5 0.085198179 190 nips-2009-Polynomial Semantic Indexing

Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1

6 0.084254213 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

7 0.083275691 211 nips-2009-Segmenting Scenes by Matching Image Composites

8 0.079433478 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

9 0.072977982 223 nips-2009-Sparse Metric Learning via Smooth Optimization

10 0.067628488 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

11 0.065866016 260 nips-2009-Zero-shot Learning with Semantic Output Codes

12 0.064248443 96 nips-2009-Filtering Abstract Senses From Image Search Results

13 0.062370203 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

14 0.062085908 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

15 0.055365879 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

16 0.052964881 104 nips-2009-Group Sparse Coding

17 0.052242965 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

18 0.051301427 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis

19 0.050807398 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

20 0.049833823 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, 0.003), (2, -0.093), (3, 0.011), (4, -0.005), (5, 0.073), (6, -0.116), (7, 0.018), (8, 0.036), (9, 0.03), (10, 0.008), (11, 0.034), (12, 0.083), (13, 0.091), (14, -0.013), (15, -0.097), (16, 0.12), (17, 0.063), (18, -0.082), (19, -0.071), (20, 0.063), (21, 0.021), (22, -0.102), (23, 0.039), (24, 0.025), (25, -0.003), (26, -0.086), (27, -0.084), (28, -0.08), (29, 0.042), (30, -0.017), (31, 0.103), (32, 0.006), (33, -0.105), (34, -0.129), (35, 0.067), (36, 0.019), (37, -0.038), (38, -0.001), (39, -0.036), (40, -0.064), (41, -0.052), (42, -0.021), (43, 0.012), (44, -0.057), (45, 0.084), (46, 0.039), (47, -0.035), (48, 0.063), (49, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92084807 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

2 0.60950929 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classification and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time. 1

3 0.55106348 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

4 0.54371125 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

5 0.50527507 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1

6 0.49653676 223 nips-2009-Sparse Metric Learning via Smooth Optimization

7 0.47121823 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

8 0.44285429 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

9 0.44102696 190 nips-2009-Polynomial Semantic Indexing

10 0.43418607 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

11 0.42485324 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

12 0.40496892 211 nips-2009-Segmenting Scenes by Matching Image Composites

13 0.39255419 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

14 0.38839918 96 nips-2009-Filtering Abstract Senses From Image Search Results

15 0.36331114 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

16 0.36289936 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

17 0.36075222 260 nips-2009-Zero-shot Learning with Semantic Output Codes

18 0.35786343 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

19 0.34837052 90 nips-2009-Factor Modeling for Advertisement Targeting

20 0.34571514 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.018), (21, 0.215), (24, 0.039), (25, 0.056), (35, 0.042), (36, 0.098), (39, 0.057), (44, 0.016), (48, 0.01), (58, 0.1), (61, 0.028), (71, 0.057), (86, 0.14), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93569028 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning filters and tree structures which are best suited for natural images we observe that the resulting filters are edge filters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of filters connected in the tree. We observe that after learning, these pairs of filters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is finding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difficult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to find a linear transformation which maximizes independence between components (thus fitting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-filters [5]. This approach has two problems. The first is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA filters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between filter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like filters in a model that assumes dependence. In our model we assume the filter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two filter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The filter matrix W, where every row defines one of the filters. The response of these filters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which specifies which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the filters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the filter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the specifics of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the filter matrix W, we calculate the response of each of the filters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of filters (nodes) to obtain a fully connected weighted graph. The final step is to find a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or filter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor filter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of finite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a finite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the first iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new filter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) filters, the second is a simple ICA model with 8 DCT filters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two filters from the DCT filter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT filters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the filters and distribution correctly, reproducing both the filters (up to rotations within the subspace in ISA) and the dependency structure between the different filters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot find the tree or the filters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, fields and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.fi/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned filters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the filters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all filters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned filters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in figure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting filters are edge filters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting filters are edge filters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the filter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the figure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the filters which is shown in Figure 5. It can be seen that neighboring filters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring filters in the tree - it is obvious that all three are highly correlated. Also apparent in this figure is the fact that the optimal phase for neighboring filters has no significant correlation. It has been suggested that filters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge filters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge filters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two filters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The first layer applies linear filters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented filters from the first layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent filter ypai and the child yi is zero, but, given that one filter has responded with a non-zero value, the likelihood is maximized when the other filter also fires (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

2 0.92753029 246 nips-2009-Time-Varying Dynamic Bayesian Networks

Author: Le Song, Mladen Kolar, Eric P. Xing

Abstract: Directed graphical models such as Bayesian networks are a favored formalism for modeling the dependency structures in complex multivariate systems such as those encountered in biology and neural science. When a system is undergoing dynamic transformation, temporally rewiring networks are needed for capturing the dynamic causal influences between covariates. In this paper, we propose time-varying dynamic Bayesian networks (TV-DBN) for modeling the structurally varying directed dependency structures underlying non-stationary biological/neural time series. This is a challenging problem due the non-stationarity and sample scarcity of time series data. We present a kernel reweighted 1 -regularized auto-regressive procedure for this problem which enjoys nice properties such as computational efficiency and provable asymptotic consistency. To our knowledge, this is the first practical and statistically sound method for structure learning of TVDBNs. We applied TV-DBNs to time series measurements during yeast cell cycle and brain response to visual stimuli. In both cases, TV-DBNs reveal interesting dynamics underlying the respective biological systems. 1

3 0.89987946 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

Abstract: We consider the problem of finding the M assignments with maximum probability in a probabilistic graphical model. We show how this problem can be formulated as a linear program (LP) on a particular polytope. We prove that, for tree graphs (and junction trees in general), this polytope has a particularly simple form and differs from the marginal polytope in a single inequality constraint. We use this characterization to provide an approximation scheme for non-tree graphs, by using the set of spanning trees over such graphs. The method we present puts the M -best inference problem in the context of LP relaxations, which have recently received considerable attention and have proven useful in solving difficult inference problems. We show empirically that our method often finds the provably exact M best configurations for problems of high tree-width. A common task in probabilistic modeling is finding the assignment with maximum probability given a model. This is often referred to as the MAP (maximum a-posteriori) problem. Of particular interest is the case of MAP in graphical models, i.e., models where the probability factors into a product over small subsets of variables. For general models, this is an NP-hard problem [11], and thus approximation algorithms are required. Of those, the class of LP based relaxations has recently received considerable attention [3, 5, 18]. In fact, it has been shown that some problems (e.g., fixed backbone protein design) can be solved exactly via sequences of increasingly tighter LP relaxations [13]. In many applications, one is interested not only in the MAP assignment but also in the M maximum probability assignments [19]. For example, in a protein design problem, we might be interested in the M amino acid sequences that are most stable on a given backbone structure [2]. In cases where the MAP problem is tractable, one can devise tractable algorithms for the M best problem [8, 19]. Specifically, for low tree-width graphs, this can be done via a variant of max-product [19]. However, when finding MAPs is not tractable, it is much less clear how to approximate the M best case. One possible approach is to use loopy max-product to obtain approximate max-marginals and use those to approximate the M best solutions [19]. However, this is largely a heuristic and does not provide any guarantees in terms of optimality certificates or bounds on the optimal values. LP approximations to MAP do enjoy such guarantees. Specifically, they provide upper bounds on the MAP value and optimality certificates. Furthermore, they often work for graphs with large tree-width [13]. The goal of the current work is to leverage the power of LP relaxations to the M best case. We begin by focusing on the problem of finding the second best solution. We show how it can be formulated as an LP over a polytope we call the “assignment-excluding marginal polytope”. In the general case, this polytope may require an exponential number of inequalities, but we prove that when the graph is a tree it has a very compact representation. We proceed to use this result to obtain approximations to the second best problem, and show how these can be tightened in various ways. Next, we show how M best assignments can be found by relying on algorithms for 1 second best assignments, and thus our results for the second best case can be used to devise an approximation algorithm for the M best problem. We conclude by applying our method to several models, showing that it often finds the exact M best assignments. 1 The M-best MAP problem and its LP formulation Consider a function on n variables defined as: f (x1 , . . . , xn ; θ) = θij (xi , xj ) + ij∈E θi (xi ) (1) i∈V where V and E are the vertices and nodes of a graph G with n nodes. We shall be interested in the M assignments with largest f (x; θ) value.1 Denote these by x(1) , . . . , x(M) , so that x(1) is the assignment that maximizes f (x; θ), x(2) is the 2nd best assignment, etc. The MAP problem (i.e., finding x(1) ) can be formulated as an LP as follows [15]. Let µ be a vector of distributions that includes {µij (xi , xj )}ij∈E over edge variables and {µi (xi )}i∈V over nodes. The set of µ that arise from some joint distribution is known as the marginal polytope [15] and is denoted by M(G). Formally: M(G) = {µ | ∃p(x) ∈ ∆ s.t. p(xi , xj ) = µij (xi , xj ) , p(xi ) = µi (xi )} . where ∆ is the set of distributions on x. The MAP problem can then be shown to be equivalent to the following LP:2 max f (x; θ) = max µ · θ , (2) x µ∈M(G) It can be shown that this LP always has a maximizing µ that is a vertex of M(G) and is integral. Furthermore, this µ corresponds to the MAP assignment x(1) . Although the number of variables in this LP is only O(|E| + |V |), the difficulty comes from an exponential number of linear inequalities generally required to describe the marginal polytope M(G). We shall find it useful to define a mapping between assignments x and integral vertices of the polytope. Given an integral vertex v ∈ M(G), define x(v) to be the assignment that maximizes vi (xi ). And, given an assignment z define v(z) to be the integral vertex in M(G) corresponding to the assignment z. Thus the LP in Eq. 2 will be maximized by v(x(1) ). One simple outer bound of the marginal polytope is the local polytope ML (G), which only enforces pairwise constraints between variables:     µi (xi ) = 1 (3) µij (xi , xj ) = µj (xj ), µij (xi , xj ) = µi (xi ), ML (G) = µ ≥ 0   x x x i j i The LP relaxation is then to maximize µ · θ where µ ∈ ML (G). For tree structured graphs, ML (G) = M(G) [15] and thus the LP relaxation yields the exact MAP x(1) . An LP Formulation for the 2nd -best MAP 2 Assume we found the MAP assignment x(1) and are now interested in finding x(2) . Is there a simple LP whose solution yields x(2) ? We begin by focusing on the case where G is a tree so that the local LP relaxation is exact. We first treat the case of a connected tree. To construct an LP whose solution is x(2) , a natural approach is to use the LP for x(1) (i.e., the LP in Eq. 2) but somehow eliminate the solution x(1) using additional constraints. This, however, is somewhat trickier than it sounds. The key difficulty is that the new constraints should not generate fractional vertices, so that the resulting LP is still exact. We begin by defining the polytope over which we need to optimize in order to obtain x(2) . 1 2 This is equivalent to finding P maximum probability assignments for a model p(x) ∝ ef (x;θ) . the P P P We use the notation µ · θ = ij∈E xi ,xj µij (xi , xj )θij (xi , xj ) + i xi µi (xi )θi (xi ) 2 Definition 1. The assignment-excluding marginal polytope is defined as: ˆ M(G, z) = {µ | ∃p(x) ∈ ∆ s.t. p(z) = 0, p(xi , xj ) = µij (xi , xj ), p(xi ) = µi (xi )} . ˆ M(G, z) is simply the convex hull of all (integral) vectors v(x) for x = z. (4) ˆ The following result shows that optimizing over M(G, x(1) ) will yield the second best soluˆ tion x(2) , so that we refer to M(G, x(1) ) as the second-best marginal polytope. Lemma 1. The 2nd best solution is obtained via the following LP: maxx=x(1) f (x; θ) = maxµ∈M(G,x(1) ) µ · θ. Furthermore, the µ that maximizes the LP on ˆ the right is integral and corresponds to the second-best MAP assignment x(2) . The proof is similar to that of Eq. 2: instead of optimizing over x, we optimize over distributions p(x), while enforcing that p(x(1) ) = 0 so that x(1) is excluded from the maximization. The key question which we now address is how to obtain a simple characterization of ˆ ˆ M(G, z). Intuitively, it would seems that M(G, z) should be “similar” to M(G), such that it can be described as M(G) plus some constraints that “block” the assignment z. To illustrate the difficulty in finding such “blocking” constraints, consider the following constraint, originally suggested by Santos [10]: i µi (zi ) ≤ n − 1. This inequality is not satisfied by µ = v(z) since v(z) attains the value n for the LHS of the above. Furthermore, for any x = z and µ = v(x), the LHS would be n − 1 or less. Thus, this inequality separates ˆ v(z) from all other integral vertices. One might conclude that we can define M(G, z) by adding this inequality to M(G). The difficulty is that the resulting polytope has fractional vertices,3 and maximizing over it won’t generally yield an integral solution. It turns out that there is a different inequality that does yield an exact characterization of ˆ M(G, z) when G is a tree. We now define this inequality and state our main theorem. Definition 2. Consider the functional I(µ, z) (which is linear in µ): (1 − di )µi (zi ) + I(µ, z) = i µij (zi , zj ) (5) ij∈E where di is the degree of node i in the tree graph G. ˆ Theorem 1. Adding the single inequality I(µ, z) ≤ 0 to M(G) yields M(G, z). ˆ M(G, z) = {µ | µ ∈ M(G), I(µ, z) ≤ 0 } (6) The theorem is proved in the appendix. Taken together with Lemma 1, it implies that x(2) may be obtained via an LP that is very similar to the MAP-LP, but has an additional constraint. We note the interesting similarity between I(µ, z) and the Bethe entropy [20]. The only difference is that in Bethe, µi , µij are replaced by H(Xi ), H(Xi , Xj ) respectively.4 The theorem also generalizes to the case where G is not a tree, but we have a junction tree for G. In this case, the theorem still holds if we define a generalized I(µ, z) inequality as: (1 − dS )µS (zS ) + S∈S µC (zC ) ≤ 0 (7) C∈C where C and S are the junction tree cliques and their separators, respectively, and dS is the number of cliques that intersect on separator S. In this case, the marginal polytope should enforce consistency between marginals µC (zC ) and their separators µS (zS ). However, such a characterization requires variables whose cardinality is exponential in the tree-width and is thus tractable only for graphs of low tree-width. In the next section, we address approximations for general graphs. A corresponding result exists for the case when G is a forest. In this case, the inequality in Eq. 6 is modified to: I(µ, z) ≤ |P | − 1, where |P | denotes the number of connected components of G. Interestingly, for a graph without edges, this gives the Santos inequality. 3 Consider the case of a single edge between 2 nodes where the MAP assignment is (0, 0). Adding the inequality µ1 (0) + µ2 (0) ≤ 1 produces the fractional vertex (0.5, 0.5). 4 The connection to Bethe can be more clearly understood from a duality-based proof of Theorem 1. We will cover this in an extended version of the manuscript. 3 2nd best LPs for general graphs - Spanning tree inequalities 3 When the graph G is not a tree, the marginal polytope M(G) generally requires an exponential number of inequalities. However, as mentioned above, it does have an exact description in terms of marginals over cliques and separators of a junction tree. Given such marginals on ˆ junction tree cliques, we also have an exact characterization of M(G, z) via the constraint in Eq. 7. However, in general, we cannot afford to be exponential in tree-width. Thus a common strategy [15] is to replace M(G) with an outer bound that enforces consistency between marginals on overlapping sets of variables. The simplest example is ML (G) in Eq. 3. ˆ In what follows, we describe an outer-bound approximation scheme for M(G, z). We use ML (G) as the approximation for M(G) (more generally ML (G) can enforce consistency between any set of small regions, e.g., triplets). When G is not a tree, the linear constraint in ˆ Eq. 6 will no longer suffice to derive M(G, z). Moreover, direct application of the inequality will incorrectly remove some integral vertices. An alternative approach is to add inequalities that separate v(z) from the other integral vertices. This will serve to eliminate more and more fractional vertices, and if enough constraints are added, this may result in an integral solution. One obvious family of such constraints are those corresponding to spanning trees in G and have the form of Eq. 5. Definition 3. Consider any T that is a spanning tree of G. Define the functional I T (µ, z): (1 − dT )µi (zi ) + i I T (µ, z) = i µij (zi , zj ) (8) ij∈T where dT is the degree of i in T . We refer to I T (µ, z) ≤ 0 as a spanning tree inequality. i For any sub-tree T of G, the corresponding spanning tree inequality separates the vertex v(z) from the other vertices. This can be shown via similar arguments as in the proof of Theorem 1. Note, however, that the resulting polytope may still have fractional vertices. The above argument shows that any spanning tree provides a separating inequality for ˆ M(G, z). In principle, we would like to use as many such inequalities as possible. Definition 4. The spanning tree assignment-excluding marginal polytope is defined as: ˆ MST (G, z) = µ | µ ∈ ML (G), L ∀ tree T ⊆ E I T (µ, z) ≤ 0 (9) where the ST notation indicates the inclusion of all spanning tree inequalities for G.5 Thus, we would actually like to perform the following optimization problem: max ˆ µ∈MST (G,z) L µ·θ ˆ as an approximation to optimization over M(G, z); i.e., we seek the optimal µ subject to all spanning tree inequalities for G with the ambition that this µ be integral and thus provide the non-z MAP assignment, with a certificate of optimality. Although the number of spanning trees is exponential in n, it turns out that all spanning inequalities can be used in practice. One way to achieve this is via a cutting plane algorithm [12] that finds the most violated spanning tree inequality and adds it to the LP. To implement this efficiently, we note that for a particular µ and a spanning tree T , the value of I T (µ, z) can be decomposed into a sum over the edges in T (and a T -independent constant): I T (µ, z) = µi (zi ) µij (zi , zj ) − µi (zi ) − µj (zj ) + (10) i ij∈T The tree maximizing the above is the maximum-weight spanning tree with edge-weights wij = µij (zi , zj ) − µi (zi ) − µj (zj ). It can thus be found efficiently. The cutting plane algorithm proceeds as follows. We start by adding an arbitrary spanning tree. Then, as long as the optimal µ is fractional, we find the spanning tree inequality that µ most violates (where this is implemented via the maximum-weight spanning tree). This constraint will necessarily remove µ from the polytope. If there are no violated inequalities 5 ˆ ˆL Note that M(G, z) ⊆ MST (G, z) ⊂ ML (G). 4 but µ is still fractional, then spanning tree inequalities do not suffice to find an integral solution (but see below on hypertree constraints to add in this case). In practice, we found that only a relatively small number of inequalities are needed to successfully yield an integral solution, or determine that all such inequalities are already satisfied. An alternative approach for solving the all spanning-tree problem is to work via the dual. The dual variables roughly correspond to points in the spanning tree polytope [16], optimization over which can be done in polynomial time, e.g., via the ellipsoid algorithm. We do not pursue this here since the cutting plane algorithm performed well in our experiments. ˆ As mentioned earlier, we can exactly characterize M(G, z) using Eq. 7, albeit at a cost exponential in the tree-width of the graph. A practical compromise would be to use inequalities over clique trees of G, where the cliques are relatively small, e.g., triplets. The corresponding constraint (Eq. 7 with the small cliques and their separators) will necessarily separate v(z) from the other integral vertices. Finding the maximally violated such inequality is an NP-hard problem, equivalent to a prize collecting Steiner tree problem, but recent work has found that such problems are often exactly solvable in practice [7]. It thus might be practical to include all such trees as constraints using a cutting plane algorithm. 4 From 2nd -best to M-best Thus far, we only dealt with the 2nd best case. As we show now, it turns out that the 2nd -best formalism can be used to devise an algorithm for M best. We begin by describing an algorithm for the exact M best and then show how it can be used to approximate those via the approximations for 2nd best described above. Fig. 1 describes our scheme, which we call Partitioning for Enumerating Solutions (or PES) for solving the M best problem. The scheme is general and only assumes that MAP-“like” problems can be solved. It is inspired by several pre-existing M best solution schemes [4, 6, 8, 19] but differs from them in highlighting the role of finding a second best solution within a given subspace. for m ← 1 to M do if m = 1 then Run MAP solver to obtain the best assignment: x(1) ≡ arg max f (x; θ) CONSTRAINTS1 ← ∅ else k ←− arg max ′ k′ ∈{1,...,m−1} f (y(k ) ; θ) // sub-space containing mth best assignment x(m) ← y(k) // mth best assignment // A variable choice that distinguishes x(m) from x(k) : (m) (v, a) ← any member of the set {(i, xi (m) ) : xi (k) = xi } CONSTRAINTSm ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(k) (as MAP) from subspace m CONSTRAINTSk ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(m) (as 2nd -best) from subspace k y(k) ← CalcNextBestSolution(CONSTRAINTSk , x(k) ) end y(m) ← CalcNextBestSolution(CONSTRAINTSm , x(m) ) end return {x(m) }M m=1 /* Find next best solution in sub-space defined by CONSTRAINTS */ Function CalcNextBestSolution(CONSTRAINTS, x(∗) ) // x(∗) is the MAP in the sub-space defined by CONSTRAINTS: Run MAP solver to obtain the second-best solution: y ≡ arg max f (x; θ), and return y. x=x(∗) ,CONSTRAINTS end Figure 1: Pseudocode for the PES algorithm. The modus operandi of the PES algorithm is to efficiently partition the search space while systematically excluding all previously determined assignments. Significantly, any MAP 5 Attractive Grids Ranks Run-times 1 50 Mixed Grids Ranks Run-times 1 50 0.5 0 S N B 0 Hard Protein SCP Ranks Run-times 1 50 0.5 S N B 0 0 S+R N+R B+R 0.5 S+R N+R B+R 0 S+R B B+R 0 S+R B B+R Figure 2: Number of best ranks and normalized run-times for the attractive and mixed grids, and the more difficult protein SCP problems. S, N, and B denote the STRIPES, Nilsson, and BMMF algorithms. Algorithms marked with +R denote that regions of variables were added for those runs. solver can be plugged into it, on the condition that it is capable of solving the arg max in the CalcNextBestSolution subroutine. The correctness of PES can be shown by observing that at the M th stage, all previous best solutions are excluded from the optimization and no other assignment is excluded. Of note, this simple partitioning scheme is possible due to the observation that the first-best and second-best MAP assignments must differ in the assignment of at least one variable in the graph. The main computational step of the PES algorithm is to maximize f (x; θ) subject to x = x(∗) and x ∈ CONSTRAINTS (see the CalcNextBestSolution subroutine). The CONSTRAINTS set merely enforces that some of the coordinates of x are either equal to or different from specified values.6 Within the LP, these can be enforced by setting µi (xi = a) = 1 or µi (xi = a) = 0. It can be shown that if one optimizes µ · θ with ˆ these constraints and µ ∈ M(G, x(∗) ), the solution is integral. Thus, the only element ˆ requiring approximation in the general case is the description of M(G, x(∗) ). We choose as ˆ this approximation the polytope MST (G, x(∗) ) in Eq. 9. We call the resulting approximaL tion algorithm Spanning TRee Inequalities and Partitioning for Enumerating Solutions, or STRIPES. In the next section, we evaluate this scheme experimentally. 5 Experiments We compared the performance of STRIPES to the BMMF algorithm [19] and the Lawler/Nilsson algorithm [6, 8]. Nilsson’s algorithm is equivalent to PES where the 2nd best assignment is obtained from maximizations within O(n) partitions, so that its runtime is O(n) times the cost of finding a single MAP. Here we approximated each MAP with its LP relaxation (as in STRIPES), so that both STRIPES and Nilsson come with certificates of optimality when their LP solutions are integral. BMMF relies on loopy BP to approximate the M best solutions.7 We used M = 50 in all experiments. To compare the algorithms, we pooled all their solutions, noting the 50 top probabilities, and then counted the fraction of these that any particular algorithm found (its solution rank). For run-time comparisons, we normalized the times by the longest-running algorithm for each example. We begin by considering pairwise MRFs on binary grid graphs of size 10 × 10. In the first experiment, we used an Ising model with attractive (submodular) potentials, a setting in which the pairwise LP relaxation is exact [14]. For each grid edge ij, we randomly chose Jij ∈ [0, 0.5], and local potentials were randomized in the range ±0.5. The results for 25 graphs are shown in Fig. 2. Both the STRIPES and Nilsson algorithms obtained the 50 optimal solutions (as learned from their optimality certificates), while BMMF clearly fared less well for some of the graphs. While the STRIPES algorithm took < 0.5 to 2 minutes to run, the Nilsson algorithm took around 13 minutes. On the other hand, BMMF was quicker, taking around 10 seconds per run, while failing to find a significant portion of the top solutions. Overall, the STRIPES algorithm was required to employ up to 19 spanning tree inequalities per calculation of second-best solution. 6 This is very different from the second best constraint, since setting x1 = 1 blocks all assignments with this value, as opposed to setting x = 1 which blocks only the assignment with all ones. 7 For BMMF, we used the C implementation at http://www.cs.huji.ac.il/~ talyam/ inference.html. The LPs for STRIPES and Nilsson were solved using CPLEX. 6 Next, we studied Ising models with mixed interaction potentials (with Jij and the local potentials randomly chosen in [−0.5, 0.5]). For almost all of the 25 models, all three algorithms were not able to successfully find the top solutions. Thus, we added regions of triplets (two for every grid face) to tighten the LP relaxation (for STRIPES and Nilsson) and to perform GBP instead of BP (for BMMF). This resulted in STRIPES and Nilsson always provably finding the optimal solutions, and BMMF mostly finding these solutions (Fig. 2). For these more difficult grids, however, STRIPES was the fastest of the algorithms, taking 0.5 - 5 minutes. On the other hand, the Nilsson and BMMF algorithms took 18 minutes and 2.5 7 minutes, respectively. STRIPES added up to 23 spanning tree inequalities per iteration. The protein side-chain prediction (SCP) problem is to to predict the placement of amino acid side-chains given a protein backbone [2, 18]. Minimization of a protein energy function corresponds to finding a MAP assignment for a pairwise MRF [19]. We employed the dataset of [18] (up to 45 states per variable, mean approximate tree-width 50), running all algorithms to calculate the optimal side-chain configurations. For 315 of 370 problems in the dataset, the first MAP solution was obtained directly as a result of the LP relaxation having an integral solution (“easy” problems). STRIPES provably found the subsequent top 50 solutions within 4.5 hours for all but one of these cases (up to 8 spanning trees per calculation), and BMMF found the same 50 solutions for each case within 0.5 hours; note that only STRIPES provides a certificate of optimality for these solutions. On the other hand, only for 146 of the 315 problems was the Nilsson method able to complete within five days; thus, we do not compare its performance here. For the remaining 55 (“hard”) problems (Fig. 2), we added problem-specific triplet regions using the MPLP algorithm [13]. We then ran the STRIPES algorithm to find the optimal solutions. Surprisingly, it was able to exactly find the 50 top solutions for all cases, using up to 4 standard spanning tree inequalities per second-best calculation. The STRIPES run-times for these problems ranged from 6 minutes to 23 hours. On the other hand, whether running BMMF without these regions (BP) or with the regions (GBP), it did not perform as well as STRIPES in terms of the number of high-ranking solutions or its speed. To summarize, STRIPES provably found the top 50 solutions for 369 of the 370 protein SCP problems. 6 Conclusion ˆ In this work, we present a novel combinatorial object M(G, z) and show its utility in obtaining the M best MAP assignments. We provide a simple characterization of it for tree structured graphs, and show how it can be used for approximations in non-tree graphs. As with the marginal polytope, many interesting questions arise about the properties of ˆ M(G, z). For example, in which non-tree cases can we provide a compact characterization (e.g., as for the cut-polytope for planar graphs [1]). Another compelling question is in which problems the spanning tree inequalities are provably optimal. An interesting generalization of our method is to predict diverse solutions satisfying some local measure of “distance” from each other, e.g., as in [2]. Here we studied the polytope that results from excluding one assignment. An intriguing question is to characterize the polytope that excludes M assignments. We have found that it does not simply correspond to adding M constraints I(µ, z i ) ≤ 0 for i = 1, . . . , M , so its ˆ geometry is apparently more complicated than that of M(G, z). Here we used LP solvers to solve for µ. Such generic solvers could be slow for large-scale problems. However, in recent years, specialized algorithms have been suggested for solving MAP-LP relaxations [3, 5, 9, 17]. These use the special form of the constraints to obtain local-updates and more scalable algorithms. We intend to apply these schemes to our method. Finally, our empirical results show that our method indeed leverages the power of LP relaxations and yields exact M best optimal solutions for problems with large tree-width. Acknowledgements We thank Nati Linial for his helpful discussions and Chen Yanover and Talya Meltzer for their insight and help in running BMMF. We also thank the anonymous reviewers for their useful advice. 7 A Proof of Theorem 1 Recall that for any µ ∈ M(G), there exists a probability density p(x) s.t. µ = x p(x)v(x). Denote pµ (z) as the minimal value of p(z) among all p(x) that give µ. We prove that ˆ pµ (z) = max(0, I(µ, z)), from which the theorem follows (since pµ (z) = 0 iff µ ∈ M(G, z)). The proof is by induction on n. For n = 1, the node has degree 0, so I(µ, z) = µ1 (z1 ). Clearly, pµ (z) = µ1 (z1 ), so pµ (z) = I(µ, z). For n > 1, there must exist a leaf in G ˆ (assume that its index is n and its neighbor’s is n − 1). Denote G as the tree obtained ˆ by removing node n and its edge with n − 1. For any assignment x, denote x as the corresponding sub-assignment for the first n − 1 variables. Also, any µ can be derived by ˆ ˆ adding appropriate coordinates to a unique µ ∈ M(G). For an integral vertex µ = v(x), ˆˆ ˆ ˆ ˆ ˆ x denote its projected µ as v (ˆ ). Denote by I(µ, z ) the functional in Eq. 5 applied to G. For ˆ any µ and its projected µ, it can be seen that: ˆˆ ˆ I(µ, z) = I(µ, z ) − α (11) where we define α = xn =zn µn−1,n (zn−1 , xn ) (so 0 ≤ α ≤ 1). The inductive assumption ˆ ˆ ˆ gives a p(ˆ ) that has marginals µ and also p(ˆ ) = max(0, I(µ, z )). We next use p(ˆ ) to ˆx ˆz ˆx construct a p(x) that has marginals µ and the desired minimal pµ (z). Consider three cases: ˆˆ ˆ I. I(µ, z) ≤ 0 and I(µ, z ) ≤ 0. From the inductive assumption, pµ (ˆ ) = 0, so we define: ˆˆ z µn−1,n (xn−1 , xn ) p(x) = p(ˆ ) ˆx (12) µn−1 (xn−1 ) which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0 as required. If µn−1 (xn−1 ) = 0, then p(ˆ ) is necessarily 0, in which case we define p(x) = 0. Note that this construction ˆx is identical to that used in proving that ML (G) = M(G) for a tree graph G. ˆˆ ˆ II. I(µ, z) > 0. Based on Eq. 11 and α ≥ 0, we have I(µ, z ) > 0. Applying the inductive ˆ µ, z ) = pµ (ˆ ) > 0. Now, define p(x) so that p(z) = I(µ, z): ˆ assumption to µ, we obtain I( ˆ ˆ ˆˆ z xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) p(x) no constraint 0 no constraint As in Eq. 12 0 0 ∃ l x l = zl 1 ∀ l x l = zl 1 µn−1,n (zn−1 , xn ) 1 1 p(ˆ ) ˆx 0 I(µ, z) Simple algebra shows that p(x) is non-negative and has µ as marginals. We now show that p(z) is minimal. Based on the inductive assumption and Eq. 11, it can easily be shown that I(v(z), z) = 1, I(v(x), z) ≤ 0 for x = z. For any p(x) s.t. µ = x p(x)v(x), from linearity, I(µ, z) = p(z) + x=z p(x)I(v(x), z) ≤ p(z) (since I(v(x), z) ≤ 0 for x = z). Since the p(z) we define achieves this lower bound, it is clearly minimal. ˆˆ ˆ ˆ III. I(µ, z) ≤ 0 but I(µ, z ) > 0. Applying the inductive assumption to µ, we see that ˆ µ, z ) > 0; Eq. 11 implies α − I(µ, z ) ≥ 0. Define β = µn−1 (zn−1 ) − pµ (ˆ ), which ˆˆ ˆ ˆˆ z pµ (ˆ ) = I( ˆ ˆ ˆˆ z ˆ is non-negative since µn−1 (zn−1 ) = µn−1 (ˆ n−1 ) and p marginalizes to µ. Define p(x) as: ˆ z ˆ xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) no constraint 0 no constraint ∃ l x l = zl As in Eq. 12 0 ˆ ˆ z µ (z ,x ) p(ˆ ) n−1,n βn−1 n α−I(µ,ˆ ) ˆx α µ (z ,z ) p(ˆ ) n−1,n βn−1 n ˆx (z ,x ) ˆˆ ˆ µ I(µ, z ) n−1,n αn−1 n 1 0 0 1 1 ∀ l x l = zl p(x) 1 which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0, as required. 8 References [1] F. Barahona. On cuts and matchings in planar graphs. Math. Program., 60(1):53–68, 1993. [2] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics, 75:682–705, 2009. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2007. [4] E. Kloppmann, G. M. Ullmann, and T. Becker. An extended dead-end elimination algorithm to determine gap-free lists of low energy states. Journal of Comp. Chem., 28:2325–2335, 2007. [5] N. Komodakis and N. Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In D. Forsyth, P. Torr, and A. Zisserman, editors, ECCV, pages 806–820, Heidelberg, Germany, 2008. Springer. [6] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18(7):401–405, 1972. [7] I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Mathematical Programming, 105:427–449, Feb 2006. [8] D. Nilsson. An efficient algorithm for finding the M most probable configurations in probabilistic expert systems. Statistics and Computing, 8:159–173, Jun 1998. [9] P. Ravikumar, A. Agarwal, and M. Wainwright. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In Proc. of the 25th international conference on Machine learning, pages 800–807, New York, NY, USA, 2008. ACM. [10] E. Santos. On the generation of alternative explanations with implications for belief revision. In Proc. of the 7th Annual Conference on Uncertainty in Artificial Intelligence, 1991. [11] Y. Shimony. Finding the MAPs for belief networks is NP-hard. 68(2):399–410, 1994. Aritifical Intelligence, [12] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2007. [13] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proc. of the 24th Annual Conference on Uncertainty in Artificial Intelligence, pages 503–510, 2008. [14] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and bregman projections. J. Mach. Learn. Res., 7:1627–1653, 2006. [15] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008. [16] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005. [17] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1165–1179, 2007. [18] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Journal of Machine Learning Research, 7:1887–1907, 2006. [19] C. Yanover and Y. Weiss. Finding the M most probable configurations using loopy belief propagation. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [20] J. Yedidia, W. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282– 2312, 2005. 9

4 0.88345397 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

Author: Matthias Bethge, Eero P. Simoncelli, Fabian H. Sinz

Abstract: We introduce a new family of distributions, called Lp -nested symmetric distributions, whose densities are expressed in terms of a hierarchical cascade of Lp norms. This class generalizes the family of spherically and Lp -spherically symmetric distributions which have recently been successfully used for natural image modeling. Similar to those distributions it allows for a nonlinear mechanism to reduce the dependencies between its variables. With suitable choices of the parameters and norms, this family includes the Independent Subspace Analysis (ISA) model as a special case, which has been proposed as a means of deriving filters that mimic complex cells found in mammalian primary visual cortex. Lp -nested distributions are relatively easy to estimate and allow us to explore the variety of models between ISA and the Lp -spherically symmetric models. By fitting the generalized Lp -nested model to 8 × 8 image patches, we show that the subspaces obtained from ISA are in fact more dependent than the individual filter coefficients within a subspace. When first applying contrast gain control as preprocessing, however, there are no dependencies left that could be exploited by ISA. This suggests that complex cell modeling can only be useful for redundancy reduction in larger image patches. 1

same-paper 5 0.8599574 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

6 0.7359882 137 nips-2009-Learning transport operators for image manifolds

7 0.72487086 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

8 0.71829849 119 nips-2009-Kernel Methods for Deep Learning

9 0.71175796 237 nips-2009-Subject independent EEG-based BCI decoding

10 0.71037555 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

11 0.70816308 104 nips-2009-Group Sparse Coding

12 0.70609564 151 nips-2009-Measuring Invariances in Deep Networks

13 0.70566362 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

14 0.70552683 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

15 0.70256591 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

16 0.70218384 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

17 0.69862705 72 nips-2009-Distribution Matching for Transduction

18 0.69597644 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

19 0.6959455 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

20 0.69302869 71 nips-2009-Distribution-Calibrated Hierarchical Classification