nips nips2009 nips2009-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vivek Farias, Srikanth Jagabathula, Devavrat Shah
Abstract: We visit the following fundamental problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal preference information), how may one predict revenues from offering a particular assortment of choices? This problem is central to areas within operations research, marketing and econometrics. We present a framework to answer such questions and design a number of tractable algorithms (from a data and computational standpoint) for the same. 1
Reference: text
sentIndex sentText sentNum sentScore
1 1 Introduction Consider a seller who must pick from a universe of N products, N , a subset M of products to offer to his customers. [sent-5, score-0.205]
2 Given a probabilistic model of how customers make choices, P(·|·), where P(i|M) is the probability that a potential customer purchases product i when faced with options M, the seller may solve (1) max pi P(i|M). [sent-7, score-0.528]
3 M⊂N i∈M In addition to being a potentially non-trivial optimization problem, one faces a far more fundamental obstacle here: specifying the ‘choice’ model P(·|·) is a difficult task and it is unlikely that a seller will have sufficient data to estimate a generic such model. [sent-8, score-0.218]
4 With a few exceptions, the typical approach to dealing with the challenge of specifying a choice model with limited data has been to make parametric assumptions on the choice model that allow for its estimation from a limited amount of data. [sent-11, score-0.503]
5 This approach has a natural deficiency – the implicit assumptions made in specifying a parametric model of choice may not hold. [sent-12, score-0.274]
6 These issues have led to a proliferation of increasingly arcane parametric choice models. [sent-14, score-0.219]
7 The present work considers the following question: given a limited amount of data on customer preferences and assuming only a ‘generic’ model of customer choice, what can one predict about expected revenues from a given set of products? [sent-15, score-0.629]
8 We take as our ‘generic’ model of customer choice, the set of distributions over all possible customer preference lists (i. [sent-16, score-0.39]
9 We will subsequently see that essentially all extant models of customer choice can be viewed as a special case of this generic model. [sent-19, score-0.353]
10 We view ‘data’ as some linear transformation of the distribution specifying the choice model, yielding marginal information. [sent-20, score-0.166]
11 1 Given these views, we first consider finding the ‘simplest’ choice model, consistent with the observed marginal data on customer preferences. [sent-26, score-0.388]
12 Here we take as our notion of simple, a distribution over permutations of N with the sparsest support. [sent-27, score-0.214]
13 We present two simple abstract properties that if satisfied by the ‘true’ choice model, allow us to solve the sparsest fit problem exactly via a simple combinatorial procedure (Theorem 2). [sent-28, score-0.287]
14 In fact, the sparsest fit in this case coincides with the true model (Theorem 1). [sent-29, score-0.131]
15 We present a generative family of choice models that illustrates when the two properties we posit may be expected to hold (see Theorem 3). [sent-30, score-0.193]
16 This entails solving mathematical programs with as many variables as there are permutations (N ! [sent-32, score-0.106]
17 In spite of this, we present a simple efficient procedure to solve this problem that is exact for certain interesting types of data and produces approximations (and computable error bounds) in general. [sent-34, score-0.093]
18 Finally, we present a computational study illustrating the efficacy of our approach relative to a parametric technique on a real-world data set. [sent-35, score-0.133]
19 Our main contribution is thus a novel approach to modeling customer choice given limited data. [sent-36, score-0.31]
20 These algorithms yield subroutines that make non-parametric revenue predictions for any given offer set (i. [sent-38, score-0.372]
21 Such subroutines could then be used in conjunction with generic set-function optimization heuristics to solve (1). [sent-41, score-0.103]
22 Relevant Literature: There is a vast body of literature on the parametric modeling of customer choice; a seminal paper in this regard is [10]. [sent-42, score-0.272]
23 [6]) on estimating and optimizing (parametric) choice models when products possess measurable attributes that are the sole influencers of choice; we do not assume the availability of such attributes and thus do not consider this situation here. [sent-45, score-0.202]
24 A non-parametric approach to choice modeling is considered by [12]; that work studies a somewhat distinct pricing problem, and assumes the availability of a specific type of rich observable data. [sent-46, score-0.292]
25 Fitting a sparsest model to observable data has recently become of great interest in the area of compressive sensing in signal processing [3, 7], and in the design of sketches for streaming algorithms [2, 4]. [sent-47, score-0.24]
26 This work focuses on deriving precise conditions on the support size of the true model, which, when satisfied, guarantee that the sparsest solution is indeed the true solution. [sent-48, score-0.165]
27 A consumer is associated with a permutation σ of the elements of N ; the customer prefers product i to product j iff σ(i) < σ(j). [sent-55, score-0.317]
28 Given that the customer is faced with a set of alternatives M ⊂ N , she chooses to purchase her single most preferred product among those in M. [sent-56, score-0.292]
29 Choice Model: We take as our model of customer choice a distribution, λ : SN → [0, 1], over all possible permutations (i. [sent-58, score-0.404]
30 Define the set Sj (M) = {σ ∈ SN : σ(j) < σ(i), ∀i ∈ M, i = j} as the set of all customer types that would result in a purchase of j when the offer set is M. [sent-61, score-0.224]
31 σ∈Sj (M) This model subsumes a vast body of extant parametric choice models. [sent-63, score-0.282]
32 Revenues: We associate every product in N with a retail price pj . [sent-64, score-0.142]
33 The expected revenues to a retailer from offering a set of products M to his customers under our choice model is thus given by R(M) = j∈M pj λj (M). [sent-66, score-0.576]
34 2 Data: A seller will have limited data with which to estimate λ. [sent-67, score-0.16]
35 We simply assume that the data observed by the seller is given by an m-dimensional ‘partial information’ vector y = Aλ, where A ∈ {0, 1}m×N ! [sent-68, score-0.156]
36 makes precise the relationship between the observed data and the underlying choice model. [sent-69, score-0.167]
37 For the purposes of illustration, we consider the following concrete examples of data vectors y: Ranking Data: This data represents the fraction of customers that rank a given product i as their rth choice. [sent-70, score-0.167]
38 Comparison Data: This data represents the fraction of customers that prefer a given product i to a product j. [sent-76, score-0.219]
39 For each i, j, yi,j denotes the probability that product i is preferred to product j. [sent-78, score-0.09]
40 Top Set Data: This data refers to a concatenation of the “Comparison Data” and information on the fraction of customers who have a given product i as their topmost choice for each i. [sent-82, score-0.253]
41 Many other types of data vectors consistent with the above view are possible; all we anticipate is that the dimension of the observed data m is substantially smaller than N ! [sent-85, score-0.145]
42 We are now in a position to formulate the questions broached in the previous section precisely: “Simplest” Model: In finding the simplest choice model consistent with the observed data we attempt to solve: (2) minimize λ 0 subject to Aλ = y, 1 λ = 1, λ ≥ 0. [sent-87, score-0.315]
43 Robust Approach: For a given offer set M, and data vector y, what are the minimal expected revenues we might expect from M consistent with the observed data? [sent-88, score-0.308]
44 To answer this question, we attempt to solve : (3) 3 minimize λ pj λj (M) subject to Aλ = y, j∈M 1 λ = 1, λ ≥ 0. [sent-89, score-0.166]
45 Estimating Sparse Choice Models Here we consider finding the sparsest model consistent with the observed data (i. [sent-90, score-0.221]
46 (b) Is there an efficient procedure to solve the program in (2)? [sent-94, score-0.105]
47 We begin by identifying two simple conditions that define a class of choice models (i. [sent-95, score-0.144]
48 Assuming that the ‘true’ underlying model λ belongs to this class, we prove that the sparsest model (i. [sent-98, score-0.154]
49 Nonetheless, we show that the conditions we impose are not overly restrictive: we prove that a “sufficiently” sparse model generated uniformly at random from the set of all possible choice models satisfies the two conditions with a high probability. [sent-103, score-0.224]
50 When the two conditions are satisfied, the sparsest solution is indeed the true solution as stated in the following theorem: Theorem 1. [sent-122, score-0.189]
51 The algorithm assumes the observed values yd are sorted. [sent-128, score-0.095]
52 Given K and an interval [a, b] on the positive real line, we generate a choice model λ as follows: choose K permutations, σ1 , σ2 , . [sent-146, score-0.134]
53 , σK , uniformly at random with replacement, choose K numbers uniformly at random from the interval [a, b], normalize the numbers so that they sum to 1, and assign them to the permutations σi , 1 ≤ i ≤ K. [sent-149, score-0.106]
54 Note that, since we are choosing permutations in the support with replacement, there could be repetitions. [sent-151, score-0.106]
55 Suppose λ is a choice model of support size K drawn from the generative model. [sent-158, score-0.158]
56 Then, λ satisfies the “signature” and “linear independence” conditions with probability 1 − o(1) as N → ∞ provided K = O(N ) for ranking data, K = o(log N ) for comparison data, √ and K = o( N ) for the top set data. [sent-159, score-0.108]
57 Of course, in general, the underlying choice model may not satisfy the two conditions we have posited or be exactly recoverable from the observed data. [sent-160, score-0.198]
58 In preparation for taking the dual, let Aj (M) {A(σ) : σ ∈ Sj (M)}, where, recall that, Sj (M) denotes the set of all permutations that result in the purchase of j ∈ M when offered the assortment M. [sent-165, score-0.306]
59 Armed with this notation, the dual of (3) is: (4) maximize α,ν subject to α y+ν α xj + ν ≤ pj , max xj ∈Aj (M) for each j ∈ M. [sent-167, score-0.314]
60 Ajd (M) is by definition a polytope contained in the m-dimensional unit cube, [0, 1]m . [sent-173, score-0.094]
61 In other words, ¯ Ajd (M) = {xjd : Ajd xjd ≥ bjd , 1 1 (5) Ajd xjd = bjd , 2 2 Ajd xjd ≤ bjd , 3 3 xjd ≥ 0. [sent-174, score-1.916]
62 By a canonical representation of Aj (M), we will thus un· · derstand a partition of Sj (M) and a polyhedral representation of the columns corresponding to every set in the partition as given by (5). [sent-176, score-0.14]
63 Ignoring the problem of actually obtaining this representation for now, we assume access to a canonical representation and present a simple program whose size is polynomial in the size of this representation that is equivalent to (3), ¯ (4). [sent-177, score-0.21]
64 For simplicity of notation, we assume that each of the polytopes Ajd (M) is in standard ¯jd (M) = {xjd : Ajd xjd = bjd , xjd ≥ 0. [sent-178, score-0.878]
65 A always optimized at the vertices of a polytope, we know: max xj ∈Aj (M) α xj + ν = max ¯ d,xjd ∈Ajd (M) α xjd + ν . [sent-182, score-0.545]
66 By strong duality we have: maximize (6) jd max ¯ xjd ∈Ajd (M) α x +ν xjd subject to α xjd + ν jd jd A x =b xjd ≥ 0. [sent-184, score-2.091]
67 jd = minimize bjd γ jd + ν subject to γ jd Ajd ≥ α γ jd We have thus established the following useful equality: α, ν : max ¯ xj ∈Aj (M) α xj + ν ≤ pj = α, ν : bjd γ jd + ν ≤ pj , γ jd Ajd ≥ α, d = 1, 2, . [sent-185, score-1.979]
68 It follows that solving (3) is equivalent to the following LP whose complexity is polynomial in the description of our canonical representation: maximize α y+ν subject to bjd γ jd + ν ≤ pj γ jd Ajd ≥ α α,ν (7) for all j ∈ M, d = 1, 2, . [sent-189, score-0.778]
69 Our ability to solve (7) relies on our ability to produce an efficient canonical representation of Sj (M). [sent-196, score-0.145]
70 Canonical Representation for Ranking Data: Recall the definition of ranking data from Section 2. [sent-198, score-0.1]
71 It is not difficult to show that the set Ajd (M) is equal to the set of all vectors xjd in {0, 1}N satisfying: N −1 xjd = 1 ri for 0 ≤ i ≤ N − 1 xjd = 1 ri r=0 jd xri ∈ {0, 1} xjd = 1 dj xjdi = 0 d for 0 ≤ r ≤ N − 1 i=0 N −1 (8) for 0 ≤ i, r ≤ N − 1. [sent-200, score-1.752]
72 Now consider replacing the third (integrality) constraint in (8) with simply the non-negativity constraint ¯ xjd ≥ 0. [sent-203, score-0.407]
73 It is clear that the resulting polytope contains Ajd (M). [sent-204, score-0.094]
74 In addition, one may ri show that the resulting polytope has integral vertices since it is simply a matching polytope ¯ with some variables forced to be integers, so that in fact the polytope is precisely Ajd (M), and we have our canonical representation. [sent-205, score-0.384]
75 We use this data as an example to illustrate a general procedure for computing a canonical representation. [sent-209, score-0.129]
76 Now consider the polytope obtained by relaxing the fourth (integrality) constraint ¯ ¯ ¯ to simply xj ≥ 0. [sent-214, score-0.211]
77 j j ik ¯ Unlike the case of ranking data, however, Ao (M) can in fact be shown to be non-integral, j ¯o (M) = Aj (M) in general. [sent-217, score-0.129]
78 ] Solve the optimization problem max α(1) xj subject to xj ∈ Ao (M) for each j. [sent-224, score-0.217]
79 ¯ ¯ ¯ Define outer-approximations to Aj1 (M) and Aj2 (M) as the projection of Ao (M) on xj = 1 j ik j and xik = 0 respectively. [sent-231, score-0.147]
80 6 5 An Empirical Evaluation of the Approach We have presented simple sub-routines to estimate the revenues R(M) from a particular offer set M, given marginal preference data y. [sent-236, score-0.247]
81 These sub-routines are effectively ‘non-parametric’ and can form the basis of a procedure that solves the revenue optimization problem posed in the introduction. [sent-237, score-0.375]
82 Here we seek to contrast this approach with a commonly used parametric approach. [sent-238, score-0.108]
83 We consider two types of observable data: ranking data and a ‘censored’ version of the comparison data which gives us for every pair of products i, j, = 0, the fraction of customers that prefer i to j, and in addition prefer i to 0 (i. [sent-239, score-0.4]
84 The parametric recipe we consider is the following: One fits a Multinomial Logit (MNL) model to the observable data and picks an optimal offer set by evaluating R(M) = j∈M pj P(j|M) assuming P(·|M) follows the estimated model. [sent-243, score-0.337]
85 The MNL is a commonly used parametric model that associates with each product i in N a positive scalar wi ; w0 = 1 by convention. [sent-244, score-0.176]
86 In place of making this parametric assumption, we could instead evaluate R(M) using the robust sub-routine developed in the previous section and pick M to maximize this conservative estimate. [sent-246, score-0.131]
87 It is clear that if the MNL model is a poor fit to the true choice model, P, our robust approach is likely to outperform the parametric approach substantially. [sent-247, score-0.242]
88 Instead, what we focus on here is what happens if the MNL model is a perfect fit to the true choice model. [sent-248, score-0.134]
89 In this case, the parametric approach is the best possible. [sent-249, score-0.108]
90 The model and prices were specified using customer utilities for Amazon. [sent-252, score-0.215]
91 We generate synthetic observed data (of both the ranking type and the comparison type) according to this fitted MNL model. [sent-254, score-0.16]
92 We conduct the following experiments: Quality of Revenue Predictions: For each type of observable data we compute our estimate of the minimum value that R(M) can take on, consistent with that data, by solving (3). [sent-256, score-0.172]
93 Figures 1(b) and 1(d) compare these two quantities for a set of randomly chosen subsets M of the 25 potential DVD’s assuming ranking data and the censored comparison data respectively. [sent-258, score-0.16]
94 In both cases, our procedure produces excellent predictions of expected revenue without making the assumptions on P(·|·) inherent in the MNL model. [sent-259, score-0.433]
95 Quality of Optimal Solutions to Revenue Maximization Problems: For each type of observable data, we compute optimal offer sets M of varying capacities assuming the fitted MNL model and an optimization procedure described in [13]. [sent-260, score-0.166]
96 We then evaluate the revenue predictions for these optimal offer sets by solving (3). [sent-261, score-0.345]
97 The gap between the ‘MNL’ and the ‘MIN’ curves is thus an upper bound on the expected revenue loss if one used our nonparametric procedure to pick an optimal offer set M over the parametric procedure (which in this setting is optimal). [sent-263, score-0.594]
98 Again, we see that the revenue loss is surprisingly small. [sent-264, score-0.345]
99 6 Conclusion and Potential Future Directions We have presented a general framework that allows us to answer questions related to how consumers choose among alternatives using limited observable data and without making additional parametric assumptions. [sent-265, score-0.326]
100 The approaches we have proposed are feasible from a data availability standpoint as well as a computational standpoint and provide a much needed non-parametric ‘sub-routine’ for the revenue optimization problems described at the outset. [sent-266, score-0.484]
wordName wordTfidf (topN-words)
[('ajd', 0.419), ('xjd', 0.359), ('revenue', 0.345), ('mnl', 0.339), ('jd', 0.208), ('customer', 0.164), ('bjd', 0.16), ('revenues', 0.16), ('sj', 0.145), ('aj', 0.14), ('assortment', 0.14), ('choice', 0.111), ('sparsest', 0.108), ('parametric', 0.108), ('permutations', 0.106), ('seller', 0.1), ('signature', 0.1), ('pj', 0.097), ('polytope', 0.094), ('xj', 0.093), ('ao', 0.085), ('observable', 0.084), ('ranking', 0.075), ('canonical', 0.074), ('customers', 0.072), ('dollars', 0.064), ('logit', 0.064), ('yd', 0.064), ('purchase', 0.06), ('expected', 0.058), ('products', 0.055), ('ik', 0.054), ('dj', 0.052), ('marketing', 0.052), ('er', 0.046), ('su', 0.045), ('product', 0.045), ('consumers', 0.04), ('ering', 0.04), ('extant', 0.04), ('integrality', 0.04), ('purchases', 0.04), ('rusmevichientong', 0.04), ('sjd', 0.04), ('di', 0.039), ('preference', 0.039), ('standpoint', 0.039), ('generic', 0.038), ('solve', 0.038), ('sn', 0.038), ('program', 0.037), ('availability', 0.036), ('censored', 0.035), ('devavrat', 0.035), ('jagabathula', 0.035), ('limited', 0.035), ('consistent', 0.034), ('questions', 0.034), ('representation', 0.033), ('permutation', 0.033), ('conditions', 0.033), ('prefer', 0.032), ('pricing', 0.032), ('specifying', 0.032), ('tted', 0.031), ('observed', 0.031), ('subject', 0.031), ('procedure', 0.03), ('satis', 0.03), ('lp', 0.03), ('anticipate', 0.03), ('consumer', 0.03), ('theorem', 0.029), ('multinomial', 0.029), ('type', 0.029), ('dth', 0.028), ('prices', 0.028), ('ri', 0.028), ('subroutines', 0.027), ('universe', 0.027), ('vf', 0.027), ('ci', 0.027), ('independence', 0.026), ('simplest', 0.026), ('business', 0.025), ('data', 0.025), ('constraint', 0.024), ('solution', 0.024), ('course', 0.024), ('impose', 0.024), ('generative', 0.024), ('enforces', 0.023), ('faced', 0.023), ('integers', 0.023), ('else', 0.023), ('pi', 0.023), ('pick', 0.023), ('model', 0.023), ('economic', 0.023), ('marginal', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 7 nips-2009-A Data-Driven Approach to Modeling Choice
Author: Vivek Farias, Srikanth Jagabathula, Devavrat Shah
Abstract: We visit the following fundamental problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal preference information), how may one predict revenues from offering a particular assortment of choices? This problem is central to areas within operations research, marketing and econometrics. We present a framework to answer such questions and design a number of tractable algorithms (from a data and computational standpoint) for the same. 1
2 0.11643631 78 nips-2009-Efficient Moments-based Permutation Tests
Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang
Abstract: In this paper, we develop an efficient moments-based permutation test approach to improve the test s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency. 1 In t ro d u c t i o n Permutation tests are flexible nonparametric alternatives to parametric tests in small samples, or when the distribution of a test statistic is unknown or mathematically intractable. In permutation tests, except exchangeability, no other statistical assumptions are required. The p-values can be obtained by using the permutation distribution. Permutation tests are appealing in many biomedical studies, which often have limited observations with unknown distribution. They have been used successfully in structural MR image analysis [1, 2, 3], in functional MR image analysis [4], and in 3D face analysis [5]. There are three common approaches to construct the permutation distribution [6, 7, 8]: (1) exact permutation enumerating all possible arrangements; (2) approximate permutation based on random sampling from all possible permutations; (3) approximate permutation using the analytical moments of the exact permutation distribution under the null hypothesis. The main disadvantage of the exact permutation is the computational cost, due to the factorial increase in the number of permutations with the increasing number of subjects. The second technique often gives inflated type I errors caused by random sampling. When a large number of repeated tests are needed, the random permutation strategy is also computationally expensive to achieve satisfactory accuracy. Regarding the third approach, the exact permutation distribution may not have moments or moments with tractability. In most applications, it is not the existence but the derivation of moments that limits the third approach. To the best of our knowledge, there is no systematic and efficient way to derive the moments of the permutation distribution. Recently, Zhou [3] proposed a solution by converting the permutation of data to that of the statistic coefficients that are symmetric to the permutation. Since the test statistic coefficients usually have simple presentations, it is easier to track the permutation of the test statistic coefficients than that of data. However, this method requires the derivation of the permutation for each specific test statistic, which is not accessible to practical users. In this paper, we propose a novel strategy by employing a general theoretical method to derive the moments of the permutation distribution of any weighted v-statistics, for both univariate and multivariate data. We note that any moments of the permutation distribution for weighted v-statistics [9] can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Our key idea is to divide the whole index set into several permutation equivalent (see Definition 2) index subsets such that the summation of the data/index function term over all permutations is invariant within each subset and can be calculated without conducting any permutation. Then we can obtain the moments by summing up several subtotals. The proposed method can be extended to equivalent weighted v-statistics by replacing them with monotonic weighted v-statistics. This is due to the fact that only the order of test statistics of all permutations matters for obtaining the p-values, so that the monotonic weighted v-statistics shares the same p-value with the original test statistic. Given the first four moments, the permutation distribution can be well fitted by Pearson distribution series. The p-values are then obtained without conducting any real permutation. For multiple comparison of two-group difference, given the sample size n1 = 21 and n2 = 21, the number of tests m = 2,000, we need to conduct m×(n1 + n2 )!/n1 !/n2 ! 1.1×1015 permutations for the exact permutation test. Even for 20,000 random permutations per test, we still need m×20,000 4×107 permutations. Alternatively, our moments-based permutation method using Pearson distribution approximation only involves the calculation of the first four analytically-derived moments of exact permutation distributions to achieve high accuracy (see section 3). Instead of calculating test statistics in factorial scale with exact permutation, our moments-based permutation only requires computation of polynomial order. For example, the computational cost for univariate mean difference test statistic and modified multivariate Hotelling's T2 test statistics [8] are O(n) and O(n3 ), respectively, where n = n 1 + n2 . 2 M e t h o d o lo g y In this section, we shall mainly discuss how to calculate the moments of the permutation distribution for weighted v-statistics. For other test statistics, a possible solution is to replace them with their equivalent weighted v-statistics by monotonic transforms. The detailed discussion about equivalent test statistics can be found in [7, 8, 10]. 2.1 C o m p ut a t i o n a l c h a l l e n g e Let us first look at a toy example. Suppose we have a two-group univariate data x = ( x1 , L , xn1 , xn1 +1 ,L , xn1 + n2 ) , where the first n1 elements are in group A and the rest, n2 ,are in group B. For comparison of the two groups, the hypothesis is typically constructed as: H 0 : m A = mB vs. H a : m A ¹ m B , where m A , m B are the population means of the groups A n1 n i =1 i = n1 +1 and B, respectively. Define x A = å xi / n1 and xB = å xi / n2 as the sample means of two groups, where n=n1+n2. We choose the univariate group mean difference statistic, i.e., n T ( x) = x A - xB = å w(i ) xi , i =1 where the index function as the test w(i ) = 1/ n1 , if i Î {1, L , n1} and w(i ) = -1/ n2 , if i Î {n1 + 1, L, n} . Then the total number of all possible permutations of {1, L, n} is n!. To calculate the fourth moment of the permutation distribution, n n n n n 1 1 4 å ( å w(i ) xp (i ) ) = å å å å å w(i1 ) w(i2 ) w(i3 ) w(i4 )xp ( i1 ) xp ( i2 ) xp ( i3 ) xp ( i4 ) , n ! p ÎS n i =1 n ! p ÎS n i1 =1 i2 =1 i3 =1 i4 =1 where is the permutation operator and the symmetric group Sn [11] includes all distinct permutations. The above example shows that the moment calculation can be considered as a summation over all possible permutations and a large index set. It is noticeable that the computational challenge here is to go through the factorial level permutations and polynomial level indices. Ep (T 4 ( x ))= 2.2 P a r t i t i o n t h e i n de x s e t In this paper, we assume that the test statistic T can be expressed as a weighted v-statistic of n n i1 =1 id =1 degree d [9], that is, T ( x) = å L å w(i1 , L , id ) h( xi1 ,L , xid ) , where x = ( x1 , x 2 , L , x n ) T is a data with n observations, and w is a symmetric index function. h is a symmetric data function, i.e., invariant under permutation of (i1 ,L , id ) . Though the symmetry property is not required for our method, it helps reduce the computational cost. Here, each observation xk can be either univariate or multivariate. In the above toy example, d=1 and h is the identity function. Therefore, the r-th moment of the test statistic from the permutated data is: Ep (T r ( x)) = Ep ( å w(i1 ,L , id )h( xp (i1 ) ,L , xp ( id ) ))r i1 ,i2 ,L,id = Ep [ å i1(1) ,L, id (1) , r r k =1 k =1 { Õ w(i1(k ) ,L, id ( k ) ) Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) )}] . L i1( r ) ,L,id ( r ) d 1 Then we can exchange the summation order of permutations and that of indices, Ep (T r ( x)) = å i1(1) ,L, id (1) , L i1( r ) ,L,id ( r ) r r k =1 k =1 {( Õ w(i1( k ) ,L , id ( k ) )) Ep ( Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) ))}. d 1 Thus any moment of permutation distribution can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Since all possible permutations map any index value between 1 and n to all possible index r values from 1 to n with equal probability, Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) , the summation of k =1 1 d data function over all permutations is only related to the equal/unequal relationship among indices. It is natural to divide the whole index set U = {i1 ,L , id }r = {(i1(1) , L , id (1) ), L (r ) , ( i1 , L , id (r ) r )} into the union of disjoint index subsets, in which Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) k =1 1 d is invariant. Definition 1. Since h is a symmetric function, two index elements (i1 ,L , id ) and ( j1 ,L , jd ) are said to be equivalent if they are the same up to the order. For example, for d = 3, (1, 4, 5) = (1,5,4) = (4,1,5) = (4,5,1) = (5,1,4) = (5,4,1). Definition 2. Two indices {(i1(1) , L , id (1) ), L , (i1( r ) , L , id ( r ) )} and {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} are said to be permutation equivalent/ if there exists a permutation p Î Sn such that {(p (i1(1) ), L , p (id (1) )), L , (p (i1( r ) ), L , p (id ( r ) ))} = {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} . Here
3 0.070664823 31 nips-2009-An LP View of the M-best MAP problem
Author: Menachem Fromer, Amir Globerson
Abstract: We consider the problem of finding the M assignments with maximum probability in a probabilistic graphical model. We show how this problem can be formulated as a linear program (LP) on a particular polytope. We prove that, for tree graphs (and junction trees in general), this polytope has a particularly simple form and differs from the marginal polytope in a single inequality constraint. We use this characterization to provide an approximation scheme for non-tree graphs, by using the set of spanning trees over such graphs. The method we present puts the M -best inference problem in the context of LP relaxations, which have recently received considerable attention and have proven useful in solving difficult inference problems. We show empirically that our method often finds the provably exact M best configurations for problems of high tree-width. A common task in probabilistic modeling is finding the assignment with maximum probability given a model. This is often referred to as the MAP (maximum a-posteriori) problem. Of particular interest is the case of MAP in graphical models, i.e., models where the probability factors into a product over small subsets of variables. For general models, this is an NP-hard problem [11], and thus approximation algorithms are required. Of those, the class of LP based relaxations has recently received considerable attention [3, 5, 18]. In fact, it has been shown that some problems (e.g., fixed backbone protein design) can be solved exactly via sequences of increasingly tighter LP relaxations [13]. In many applications, one is interested not only in the MAP assignment but also in the M maximum probability assignments [19]. For example, in a protein design problem, we might be interested in the M amino acid sequences that are most stable on a given backbone structure [2]. In cases where the MAP problem is tractable, one can devise tractable algorithms for the M best problem [8, 19]. Specifically, for low tree-width graphs, this can be done via a variant of max-product [19]. However, when finding MAPs is not tractable, it is much less clear how to approximate the M best case. One possible approach is to use loopy max-product to obtain approximate max-marginals and use those to approximate the M best solutions [19]. However, this is largely a heuristic and does not provide any guarantees in terms of optimality certificates or bounds on the optimal values. LP approximations to MAP do enjoy such guarantees. Specifically, they provide upper bounds on the MAP value and optimality certificates. Furthermore, they often work for graphs with large tree-width [13]. The goal of the current work is to leverage the power of LP relaxations to the M best case. We begin by focusing on the problem of finding the second best solution. We show how it can be formulated as an LP over a polytope we call the “assignment-excluding marginal polytope”. In the general case, this polytope may require an exponential number of inequalities, but we prove that when the graph is a tree it has a very compact representation. We proceed to use this result to obtain approximations to the second best problem, and show how these can be tightened in various ways. Next, we show how M best assignments can be found by relying on algorithms for 1 second best assignments, and thus our results for the second best case can be used to devise an approximation algorithm for the M best problem. We conclude by applying our method to several models, showing that it often finds the exact M best assignments. 1 The M-best MAP problem and its LP formulation Consider a function on n variables defined as: f (x1 , . . . , xn ; θ) = θij (xi , xj ) + ij∈E θi (xi ) (1) i∈V where V and E are the vertices and nodes of a graph G with n nodes. We shall be interested in the M assignments with largest f (x; θ) value.1 Denote these by x(1) , . . . , x(M) , so that x(1) is the assignment that maximizes f (x; θ), x(2) is the 2nd best assignment, etc. The MAP problem (i.e., finding x(1) ) can be formulated as an LP as follows [15]. Let µ be a vector of distributions that includes {µij (xi , xj )}ij∈E over edge variables and {µi (xi )}i∈V over nodes. The set of µ that arise from some joint distribution is known as the marginal polytope [15] and is denoted by M(G). Formally: M(G) = {µ | ∃p(x) ∈ ∆ s.t. p(xi , xj ) = µij (xi , xj ) , p(xi ) = µi (xi )} . where ∆ is the set of distributions on x. The MAP problem can then be shown to be equivalent to the following LP:2 max f (x; θ) = max µ · θ , (2) x µ∈M(G) It can be shown that this LP always has a maximizing µ that is a vertex of M(G) and is integral. Furthermore, this µ corresponds to the MAP assignment x(1) . Although the number of variables in this LP is only O(|E| + |V |), the difficulty comes from an exponential number of linear inequalities generally required to describe the marginal polytope M(G). We shall find it useful to define a mapping between assignments x and integral vertices of the polytope. Given an integral vertex v ∈ M(G), define x(v) to be the assignment that maximizes vi (xi ). And, given an assignment z define v(z) to be the integral vertex in M(G) corresponding to the assignment z. Thus the LP in Eq. 2 will be maximized by v(x(1) ). One simple outer bound of the marginal polytope is the local polytope ML (G), which only enforces pairwise constraints between variables: µi (xi ) = 1 (3) µij (xi , xj ) = µj (xj ), µij (xi , xj ) = µi (xi ), ML (G) = µ ≥ 0 x x x i j i The LP relaxation is then to maximize µ · θ where µ ∈ ML (G). For tree structured graphs, ML (G) = M(G) [15] and thus the LP relaxation yields the exact MAP x(1) . An LP Formulation for the 2nd -best MAP 2 Assume we found the MAP assignment x(1) and are now interested in finding x(2) . Is there a simple LP whose solution yields x(2) ? We begin by focusing on the case where G is a tree so that the local LP relaxation is exact. We first treat the case of a connected tree. To construct an LP whose solution is x(2) , a natural approach is to use the LP for x(1) (i.e., the LP in Eq. 2) but somehow eliminate the solution x(1) using additional constraints. This, however, is somewhat trickier than it sounds. The key difficulty is that the new constraints should not generate fractional vertices, so that the resulting LP is still exact. We begin by defining the polytope over which we need to optimize in order to obtain x(2) . 1 2 This is equivalent to finding P maximum probability assignments for a model p(x) ∝ ef (x;θ) . the P P P We use the notation µ · θ = ij∈E xi ,xj µij (xi , xj )θij (xi , xj ) + i xi µi (xi )θi (xi ) 2 Definition 1. The assignment-excluding marginal polytope is defined as: ˆ M(G, z) = {µ | ∃p(x) ∈ ∆ s.t. p(z) = 0, p(xi , xj ) = µij (xi , xj ), p(xi ) = µi (xi )} . ˆ M(G, z) is simply the convex hull of all (integral) vectors v(x) for x = z. (4) ˆ The following result shows that optimizing over M(G, x(1) ) will yield the second best soluˆ tion x(2) , so that we refer to M(G, x(1) ) as the second-best marginal polytope. Lemma 1. The 2nd best solution is obtained via the following LP: maxx=x(1) f (x; θ) = maxµ∈M(G,x(1) ) µ · θ. Furthermore, the µ that maximizes the LP on ˆ the right is integral and corresponds to the second-best MAP assignment x(2) . The proof is similar to that of Eq. 2: instead of optimizing over x, we optimize over distributions p(x), while enforcing that p(x(1) ) = 0 so that x(1) is excluded from the maximization. The key question which we now address is how to obtain a simple characterization of ˆ ˆ M(G, z). Intuitively, it would seems that M(G, z) should be “similar” to M(G), such that it can be described as M(G) plus some constraints that “block” the assignment z. To illustrate the difficulty in finding such “blocking” constraints, consider the following constraint, originally suggested by Santos [10]: i µi (zi ) ≤ n − 1. This inequality is not satisfied by µ = v(z) since v(z) attains the value n for the LHS of the above. Furthermore, for any x = z and µ = v(x), the LHS would be n − 1 or less. Thus, this inequality separates ˆ v(z) from all other integral vertices. One might conclude that we can define M(G, z) by adding this inequality to M(G). The difficulty is that the resulting polytope has fractional vertices,3 and maximizing over it won’t generally yield an integral solution. It turns out that there is a different inequality that does yield an exact characterization of ˆ M(G, z) when G is a tree. We now define this inequality and state our main theorem. Definition 2. Consider the functional I(µ, z) (which is linear in µ): (1 − di )µi (zi ) + I(µ, z) = i µij (zi , zj ) (5) ij∈E where di is the degree of node i in the tree graph G. ˆ Theorem 1. Adding the single inequality I(µ, z) ≤ 0 to M(G) yields M(G, z). ˆ M(G, z) = {µ | µ ∈ M(G), I(µ, z) ≤ 0 } (6) The theorem is proved in the appendix. Taken together with Lemma 1, it implies that x(2) may be obtained via an LP that is very similar to the MAP-LP, but has an additional constraint. We note the interesting similarity between I(µ, z) and the Bethe entropy [20]. The only difference is that in Bethe, µi , µij are replaced by H(Xi ), H(Xi , Xj ) respectively.4 The theorem also generalizes to the case where G is not a tree, but we have a junction tree for G. In this case, the theorem still holds if we define a generalized I(µ, z) inequality as: (1 − dS )µS (zS ) + S∈S µC (zC ) ≤ 0 (7) C∈C where C and S are the junction tree cliques and their separators, respectively, and dS is the number of cliques that intersect on separator S. In this case, the marginal polytope should enforce consistency between marginals µC (zC ) and their separators µS (zS ). However, such a characterization requires variables whose cardinality is exponential in the tree-width and is thus tractable only for graphs of low tree-width. In the next section, we address approximations for general graphs. A corresponding result exists for the case when G is a forest. In this case, the inequality in Eq. 6 is modified to: I(µ, z) ≤ |P | − 1, where |P | denotes the number of connected components of G. Interestingly, for a graph without edges, this gives the Santos inequality. 3 Consider the case of a single edge between 2 nodes where the MAP assignment is (0, 0). Adding the inequality µ1 (0) + µ2 (0) ≤ 1 produces the fractional vertex (0.5, 0.5). 4 The connection to Bethe can be more clearly understood from a duality-based proof of Theorem 1. We will cover this in an extended version of the manuscript. 3 2nd best LPs for general graphs - Spanning tree inequalities 3 When the graph G is not a tree, the marginal polytope M(G) generally requires an exponential number of inequalities. However, as mentioned above, it does have an exact description in terms of marginals over cliques and separators of a junction tree. Given such marginals on ˆ junction tree cliques, we also have an exact characterization of M(G, z) via the constraint in Eq. 7. However, in general, we cannot afford to be exponential in tree-width. Thus a common strategy [15] is to replace M(G) with an outer bound that enforces consistency between marginals on overlapping sets of variables. The simplest example is ML (G) in Eq. 3. ˆ In what follows, we describe an outer-bound approximation scheme for M(G, z). We use ML (G) as the approximation for M(G) (more generally ML (G) can enforce consistency between any set of small regions, e.g., triplets). When G is not a tree, the linear constraint in ˆ Eq. 6 will no longer suffice to derive M(G, z). Moreover, direct application of the inequality will incorrectly remove some integral vertices. An alternative approach is to add inequalities that separate v(z) from the other integral vertices. This will serve to eliminate more and more fractional vertices, and if enough constraints are added, this may result in an integral solution. One obvious family of such constraints are those corresponding to spanning trees in G and have the form of Eq. 5. Definition 3. Consider any T that is a spanning tree of G. Define the functional I T (µ, z): (1 − dT )µi (zi ) + i I T (µ, z) = i µij (zi , zj ) (8) ij∈T where dT is the degree of i in T . We refer to I T (µ, z) ≤ 0 as a spanning tree inequality. i For any sub-tree T of G, the corresponding spanning tree inequality separates the vertex v(z) from the other vertices. This can be shown via similar arguments as in the proof of Theorem 1. Note, however, that the resulting polytope may still have fractional vertices. The above argument shows that any spanning tree provides a separating inequality for ˆ M(G, z). In principle, we would like to use as many such inequalities as possible. Definition 4. The spanning tree assignment-excluding marginal polytope is defined as: ˆ MST (G, z) = µ | µ ∈ ML (G), L ∀ tree T ⊆ E I T (µ, z) ≤ 0 (9) where the ST notation indicates the inclusion of all spanning tree inequalities for G.5 Thus, we would actually like to perform the following optimization problem: max ˆ µ∈MST (G,z) L µ·θ ˆ as an approximation to optimization over M(G, z); i.e., we seek the optimal µ subject to all spanning tree inequalities for G with the ambition that this µ be integral and thus provide the non-z MAP assignment, with a certificate of optimality. Although the number of spanning trees is exponential in n, it turns out that all spanning inequalities can be used in practice. One way to achieve this is via a cutting plane algorithm [12] that finds the most violated spanning tree inequality and adds it to the LP. To implement this efficiently, we note that for a particular µ and a spanning tree T , the value of I T (µ, z) can be decomposed into a sum over the edges in T (and a T -independent constant): I T (µ, z) = µi (zi ) µij (zi , zj ) − µi (zi ) − µj (zj ) + (10) i ij∈T The tree maximizing the above is the maximum-weight spanning tree with edge-weights wij = µij (zi , zj ) − µi (zi ) − µj (zj ). It can thus be found efficiently. The cutting plane algorithm proceeds as follows. We start by adding an arbitrary spanning tree. Then, as long as the optimal µ is fractional, we find the spanning tree inequality that µ most violates (where this is implemented via the maximum-weight spanning tree). This constraint will necessarily remove µ from the polytope. If there are no violated inequalities 5 ˆ ˆL Note that M(G, z) ⊆ MST (G, z) ⊂ ML (G). 4 but µ is still fractional, then spanning tree inequalities do not suffice to find an integral solution (but see below on hypertree constraints to add in this case). In practice, we found that only a relatively small number of inequalities are needed to successfully yield an integral solution, or determine that all such inequalities are already satisfied. An alternative approach for solving the all spanning-tree problem is to work via the dual. The dual variables roughly correspond to points in the spanning tree polytope [16], optimization over which can be done in polynomial time, e.g., via the ellipsoid algorithm. We do not pursue this here since the cutting plane algorithm performed well in our experiments. ˆ As mentioned earlier, we can exactly characterize M(G, z) using Eq. 7, albeit at a cost exponential in the tree-width of the graph. A practical compromise would be to use inequalities over clique trees of G, where the cliques are relatively small, e.g., triplets. The corresponding constraint (Eq. 7 with the small cliques and their separators) will necessarily separate v(z) from the other integral vertices. Finding the maximally violated such inequality is an NP-hard problem, equivalent to a prize collecting Steiner tree problem, but recent work has found that such problems are often exactly solvable in practice [7]. It thus might be practical to include all such trees as constraints using a cutting plane algorithm. 4 From 2nd -best to M-best Thus far, we only dealt with the 2nd best case. As we show now, it turns out that the 2nd -best formalism can be used to devise an algorithm for M best. We begin by describing an algorithm for the exact M best and then show how it can be used to approximate those via the approximations for 2nd best described above. Fig. 1 describes our scheme, which we call Partitioning for Enumerating Solutions (or PES) for solving the M best problem. The scheme is general and only assumes that MAP-“like” problems can be solved. It is inspired by several pre-existing M best solution schemes [4, 6, 8, 19] but differs from them in highlighting the role of finding a second best solution within a given subspace. for m ← 1 to M do if m = 1 then Run MAP solver to obtain the best assignment: x(1) ≡ arg max f (x; θ) CONSTRAINTS1 ← ∅ else k ←− arg max ′ k′ ∈{1,...,m−1} f (y(k ) ; θ) // sub-space containing mth best assignment x(m) ← y(k) // mth best assignment // A variable choice that distinguishes x(m) from x(k) : (m) (v, a) ← any member of the set {(i, xi (m) ) : xi (k) = xi } CONSTRAINTSm ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(k) (as MAP) from subspace m CONSTRAINTSk ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(m) (as 2nd -best) from subspace k y(k) ← CalcNextBestSolution(CONSTRAINTSk , x(k) ) end y(m) ← CalcNextBestSolution(CONSTRAINTSm , x(m) ) end return {x(m) }M m=1 /* Find next best solution in sub-space defined by CONSTRAINTS */ Function CalcNextBestSolution(CONSTRAINTS, x(∗) ) // x(∗) is the MAP in the sub-space defined by CONSTRAINTS: Run MAP solver to obtain the second-best solution: y ≡ arg max f (x; θ), and return y. x=x(∗) ,CONSTRAINTS end Figure 1: Pseudocode for the PES algorithm. The modus operandi of the PES algorithm is to efficiently partition the search space while systematically excluding all previously determined assignments. Significantly, any MAP 5 Attractive Grids Ranks Run-times 1 50 Mixed Grids Ranks Run-times 1 50 0.5 0 S N B 0 Hard Protein SCP Ranks Run-times 1 50 0.5 S N B 0 0 S+R N+R B+R 0.5 S+R N+R B+R 0 S+R B B+R 0 S+R B B+R Figure 2: Number of best ranks and normalized run-times for the attractive and mixed grids, and the more difficult protein SCP problems. S, N, and B denote the STRIPES, Nilsson, and BMMF algorithms. Algorithms marked with +R denote that regions of variables were added for those runs. solver can be plugged into it, on the condition that it is capable of solving the arg max in the CalcNextBestSolution subroutine. The correctness of PES can be shown by observing that at the M th stage, all previous best solutions are excluded from the optimization and no other assignment is excluded. Of note, this simple partitioning scheme is possible due to the observation that the first-best and second-best MAP assignments must differ in the assignment of at least one variable in the graph. The main computational step of the PES algorithm is to maximize f (x; θ) subject to x = x(∗) and x ∈ CONSTRAINTS (see the CalcNextBestSolution subroutine). The CONSTRAINTS set merely enforces that some of the coordinates of x are either equal to or different from specified values.6 Within the LP, these can be enforced by setting µi (xi = a) = 1 or µi (xi = a) = 0. It can be shown that if one optimizes µ · θ with ˆ these constraints and µ ∈ M(G, x(∗) ), the solution is integral. Thus, the only element ˆ requiring approximation in the general case is the description of M(G, x(∗) ). We choose as ˆ this approximation the polytope MST (G, x(∗) ) in Eq. 9. We call the resulting approximaL tion algorithm Spanning TRee Inequalities and Partitioning for Enumerating Solutions, or STRIPES. In the next section, we evaluate this scheme experimentally. 5 Experiments We compared the performance of STRIPES to the BMMF algorithm [19] and the Lawler/Nilsson algorithm [6, 8]. Nilsson’s algorithm is equivalent to PES where the 2nd best assignment is obtained from maximizations within O(n) partitions, so that its runtime is O(n) times the cost of finding a single MAP. Here we approximated each MAP with its LP relaxation (as in STRIPES), so that both STRIPES and Nilsson come with certificates of optimality when their LP solutions are integral. BMMF relies on loopy BP to approximate the M best solutions.7 We used M = 50 in all experiments. To compare the algorithms, we pooled all their solutions, noting the 50 top probabilities, and then counted the fraction of these that any particular algorithm found (its solution rank). For run-time comparisons, we normalized the times by the longest-running algorithm for each example. We begin by considering pairwise MRFs on binary grid graphs of size 10 × 10. In the first experiment, we used an Ising model with attractive (submodular) potentials, a setting in which the pairwise LP relaxation is exact [14]. For each grid edge ij, we randomly chose Jij ∈ [0, 0.5], and local potentials were randomized in the range ±0.5. The results for 25 graphs are shown in Fig. 2. Both the STRIPES and Nilsson algorithms obtained the 50 optimal solutions (as learned from their optimality certificates), while BMMF clearly fared less well for some of the graphs. While the STRIPES algorithm took < 0.5 to 2 minutes to run, the Nilsson algorithm took around 13 minutes. On the other hand, BMMF was quicker, taking around 10 seconds per run, while failing to find a significant portion of the top solutions. Overall, the STRIPES algorithm was required to employ up to 19 spanning tree inequalities per calculation of second-best solution. 6 This is very different from the second best constraint, since setting x1 = 1 blocks all assignments with this value, as opposed to setting x = 1 which blocks only the assignment with all ones. 7 For BMMF, we used the C implementation at http://www.cs.huji.ac.il/~ talyam/ inference.html. The LPs for STRIPES and Nilsson were solved using CPLEX. 6 Next, we studied Ising models with mixed interaction potentials (with Jij and the local potentials randomly chosen in [−0.5, 0.5]). For almost all of the 25 models, all three algorithms were not able to successfully find the top solutions. Thus, we added regions of triplets (two for every grid face) to tighten the LP relaxation (for STRIPES and Nilsson) and to perform GBP instead of BP (for BMMF). This resulted in STRIPES and Nilsson always provably finding the optimal solutions, and BMMF mostly finding these solutions (Fig. 2). For these more difficult grids, however, STRIPES was the fastest of the algorithms, taking 0.5 - 5 minutes. On the other hand, the Nilsson and BMMF algorithms took 18 minutes and 2.5 7 minutes, respectively. STRIPES added up to 23 spanning tree inequalities per iteration. The protein side-chain prediction (SCP) problem is to to predict the placement of amino acid side-chains given a protein backbone [2, 18]. Minimization of a protein energy function corresponds to finding a MAP assignment for a pairwise MRF [19]. We employed the dataset of [18] (up to 45 states per variable, mean approximate tree-width 50), running all algorithms to calculate the optimal side-chain configurations. For 315 of 370 problems in the dataset, the first MAP solution was obtained directly as a result of the LP relaxation having an integral solution (“easy” problems). STRIPES provably found the subsequent top 50 solutions within 4.5 hours for all but one of these cases (up to 8 spanning trees per calculation), and BMMF found the same 50 solutions for each case within 0.5 hours; note that only STRIPES provides a certificate of optimality for these solutions. On the other hand, only for 146 of the 315 problems was the Nilsson method able to complete within five days; thus, we do not compare its performance here. For the remaining 55 (“hard”) problems (Fig. 2), we added problem-specific triplet regions using the MPLP algorithm [13]. We then ran the STRIPES algorithm to find the optimal solutions. Surprisingly, it was able to exactly find the 50 top solutions for all cases, using up to 4 standard spanning tree inequalities per second-best calculation. The STRIPES run-times for these problems ranged from 6 minutes to 23 hours. On the other hand, whether running BMMF without these regions (BP) or with the regions (GBP), it did not perform as well as STRIPES in terms of the number of high-ranking solutions or its speed. To summarize, STRIPES provably found the top 50 solutions for 369 of the 370 protein SCP problems. 6 Conclusion ˆ In this work, we present a novel combinatorial object M(G, z) and show its utility in obtaining the M best MAP assignments. We provide a simple characterization of it for tree structured graphs, and show how it can be used for approximations in non-tree graphs. As with the marginal polytope, many interesting questions arise about the properties of ˆ M(G, z). For example, in which non-tree cases can we provide a compact characterization (e.g., as for the cut-polytope for planar graphs [1]). Another compelling question is in which problems the spanning tree inequalities are provably optimal. An interesting generalization of our method is to predict diverse solutions satisfying some local measure of “distance” from each other, e.g., as in [2]. Here we studied the polytope that results from excluding one assignment. An intriguing question is to characterize the polytope that excludes M assignments. We have found that it does not simply correspond to adding M constraints I(µ, z i ) ≤ 0 for i = 1, . . . , M , so its ˆ geometry is apparently more complicated than that of M(G, z). Here we used LP solvers to solve for µ. Such generic solvers could be slow for large-scale problems. However, in recent years, specialized algorithms have been suggested for solving MAP-LP relaxations [3, 5, 9, 17]. These use the special form of the constraints to obtain local-updates and more scalable algorithms. We intend to apply these schemes to our method. Finally, our empirical results show that our method indeed leverages the power of LP relaxations and yields exact M best optimal solutions for problems with large tree-width. Acknowledgements We thank Nati Linial for his helpful discussions and Chen Yanover and Talya Meltzer for their insight and help in running BMMF. We also thank the anonymous reviewers for their useful advice. 7 A Proof of Theorem 1 Recall that for any µ ∈ M(G), there exists a probability density p(x) s.t. µ = x p(x)v(x). Denote pµ (z) as the minimal value of p(z) among all p(x) that give µ. We prove that ˆ pµ (z) = max(0, I(µ, z)), from which the theorem follows (since pµ (z) = 0 iff µ ∈ M(G, z)). The proof is by induction on n. For n = 1, the node has degree 0, so I(µ, z) = µ1 (z1 ). Clearly, pµ (z) = µ1 (z1 ), so pµ (z) = I(µ, z). For n > 1, there must exist a leaf in G ˆ (assume that its index is n and its neighbor’s is n − 1). Denote G as the tree obtained ˆ by removing node n and its edge with n − 1. For any assignment x, denote x as the corresponding sub-assignment for the first n − 1 variables. Also, any µ can be derived by ˆ ˆ adding appropriate coordinates to a unique µ ∈ M(G). For an integral vertex µ = v(x), ˆˆ ˆ ˆ ˆ ˆ x denote its projected µ as v (ˆ ). Denote by I(µ, z ) the functional in Eq. 5 applied to G. For ˆ any µ and its projected µ, it can be seen that: ˆˆ ˆ I(µ, z) = I(µ, z ) − α (11) where we define α = xn =zn µn−1,n (zn−1 , xn ) (so 0 ≤ α ≤ 1). The inductive assumption ˆ ˆ ˆ gives a p(ˆ ) that has marginals µ and also p(ˆ ) = max(0, I(µ, z )). We next use p(ˆ ) to ˆx ˆz ˆx construct a p(x) that has marginals µ and the desired minimal pµ (z). Consider three cases: ˆˆ ˆ I. I(µ, z) ≤ 0 and I(µ, z ) ≤ 0. From the inductive assumption, pµ (ˆ ) = 0, so we define: ˆˆ z µn−1,n (xn−1 , xn ) p(x) = p(ˆ ) ˆx (12) µn−1 (xn−1 ) which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0 as required. If µn−1 (xn−1 ) = 0, then p(ˆ ) is necessarily 0, in which case we define p(x) = 0. Note that this construction ˆx is identical to that used in proving that ML (G) = M(G) for a tree graph G. ˆˆ ˆ II. I(µ, z) > 0. Based on Eq. 11 and α ≥ 0, we have I(µ, z ) > 0. Applying the inductive ˆ µ, z ) = pµ (ˆ ) > 0. Now, define p(x) so that p(z) = I(µ, z): ˆ assumption to µ, we obtain I( ˆ ˆ ˆˆ z xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) p(x) no constraint 0 no constraint As in Eq. 12 0 0 ∃ l x l = zl 1 ∀ l x l = zl 1 µn−1,n (zn−1 , xn ) 1 1 p(ˆ ) ˆx 0 I(µ, z) Simple algebra shows that p(x) is non-negative and has µ as marginals. We now show that p(z) is minimal. Based on the inductive assumption and Eq. 11, it can easily be shown that I(v(z), z) = 1, I(v(x), z) ≤ 0 for x = z. For any p(x) s.t. µ = x p(x)v(x), from linearity, I(µ, z) = p(z) + x=z p(x)I(v(x), z) ≤ p(z) (since I(v(x), z) ≤ 0 for x = z). Since the p(z) we define achieves this lower bound, it is clearly minimal. ˆˆ ˆ ˆ III. I(µ, z) ≤ 0 but I(µ, z ) > 0. Applying the inductive assumption to µ, we see that ˆ µ, z ) > 0; Eq. 11 implies α − I(µ, z ) ≥ 0. Define β = µn−1 (zn−1 ) − pµ (ˆ ), which ˆˆ ˆ ˆˆ z pµ (ˆ ) = I( ˆ ˆ ˆˆ z ˆ is non-negative since µn−1 (zn−1 ) = µn−1 (ˆ n−1 ) and p marginalizes to µ. Define p(x) as: ˆ z ˆ xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) no constraint 0 no constraint ∃ l x l = zl As in Eq. 12 0 ˆ ˆ z µ (z ,x ) p(ˆ ) n−1,n βn−1 n α−I(µ,ˆ ) ˆx α µ (z ,z ) p(ˆ ) n−1,n βn−1 n ˆx (z ,x ) ˆˆ ˆ µ I(µ, z ) n−1,n αn−1 n 1 0 0 1 1 ∀ l x l = zl p(x) 1 which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0, as required. 8 References [1] F. Barahona. On cuts and matchings in planar graphs. Math. Program., 60(1):53–68, 1993. [2] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics, 75:682–705, 2009. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2007. [4] E. Kloppmann, G. M. Ullmann, and T. Becker. An extended dead-end elimination algorithm to determine gap-free lists of low energy states. Journal of Comp. Chem., 28:2325–2335, 2007. [5] N. Komodakis and N. Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In D. Forsyth, P. Torr, and A. Zisserman, editors, ECCV, pages 806–820, Heidelberg, Germany, 2008. Springer. [6] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18(7):401–405, 1972. [7] I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Mathematical Programming, 105:427–449, Feb 2006. [8] D. Nilsson. An efficient algorithm for finding the M most probable configurations in probabilistic expert systems. Statistics and Computing, 8:159–173, Jun 1998. [9] P. Ravikumar, A. Agarwal, and M. Wainwright. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In Proc. of the 25th international conference on Machine learning, pages 800–807, New York, NY, USA, 2008. ACM. [10] E. Santos. On the generation of alternative explanations with implications for belief revision. In Proc. of the 7th Annual Conference on Uncertainty in Artificial Intelligence, 1991. [11] Y. Shimony. Finding the MAPs for belief networks is NP-hard. 68(2):399–410, 1994. Aritifical Intelligence, [12] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2007. [13] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proc. of the 24th Annual Conference on Uncertainty in Artificial Intelligence, pages 503–510, 2008. [14] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and bregman projections. J. Mach. Learn. Res., 7:1627–1653, 2006. [15] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008. [16] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005. [17] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1165–1179, 2007. [18] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Journal of Machine Learning Research, 7:1887–1907, 2006. [19] C. Yanover and Y. Weiss. Finding the M most probable configurations using loopy belief propagation. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [20] J. Yedidia, W. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282– 2312, 2005. 9
4 0.061821286 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
Author: Matthias Seeger
Abstract: We show how to sequentially optimize magnetic resonance imaging measurement designs over stacks of neighbouring image slices, by performing convex variational inference on a large scale non-Gaussian linear dynamical system, tracking dominating directions of posterior covariance without imposing any factorization constraints. Our approach can be scaled up to high-resolution images by reductions to numerical mathematics primitives and parallelization on several levels. In a first study, designs are found that improve significantly on others chosen independently for each slice or drawn at random. 1
5 0.058198988 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank
Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li
Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1
6 0.05507483 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity
7 0.053877342 230 nips-2009-Statistical Consistency of Top-k Ranking
8 0.051213577 53 nips-2009-Complexity of Decentralized Control: Special Cases
9 0.045805838 67 nips-2009-Directed Regression
10 0.045636211 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling
11 0.044101231 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
12 0.043162625 87 nips-2009-Exponential Family Graph Matching and Ranking
13 0.0429409 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
14 0.041919857 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
15 0.04164885 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization
16 0.04096444 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification
17 0.039967153 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
18 0.037741851 239 nips-2009-Submodularity Cuts and Applications
19 0.0377094 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
20 0.037366834 226 nips-2009-Spatial Normalized Gamma Processes
topicId topicWeight
[(0, -0.126), (1, 0.048), (2, -0.004), (3, -0.025), (4, -0.022), (5, -0.041), (6, -0.034), (7, -0.11), (8, 0.015), (9, -0.07), (10, -0.01), (11, 0.038), (12, -0.03), (13, 0.004), (14, -0.01), (15, 0.034), (16, 0.031), (17, -0.009), (18, -0.042), (19, 0.048), (20, 0.004), (21, 0.005), (22, -0.055), (23, 0.02), (24, -0.027), (25, 0.066), (26, -0.027), (27, -0.043), (28, -0.169), (29, 0.06), (30, 0.02), (31, 0.028), (32, 0.016), (33, -0.033), (34, 0.03), (35, 0.058), (36, 0.009), (37, 0.101), (38, -0.01), (39, -0.072), (40, 0.047), (41, -0.04), (42, -0.008), (43, -0.01), (44, -0.034), (45, -0.024), (46, 0.106), (47, -0.083), (48, -0.05), (49, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.87992799 7 nips-2009-A Data-Driven Approach to Modeling Choice
Author: Vivek Farias, Srikanth Jagabathula, Devavrat Shah
Abstract: We visit the following fundamental problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal preference information), how may one predict revenues from offering a particular assortment of choices? This problem is central to areas within operations research, marketing and econometrics. We present a framework to answer such questions and design a number of tractable algorithms (from a data and computational standpoint) for the same. 1
2 0.71370965 78 nips-2009-Efficient Moments-based Permutation Tests
Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang
Abstract: In this paper, we develop an efficient moments-based permutation test approach to improve the test s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency. 1 In t ro d u c t i o n Permutation tests are flexible nonparametric alternatives to parametric tests in small samples, or when the distribution of a test statistic is unknown or mathematically intractable. In permutation tests, except exchangeability, no other statistical assumptions are required. The p-values can be obtained by using the permutation distribution. Permutation tests are appealing in many biomedical studies, which often have limited observations with unknown distribution. They have been used successfully in structural MR image analysis [1, 2, 3], in functional MR image analysis [4], and in 3D face analysis [5]. There are three common approaches to construct the permutation distribution [6, 7, 8]: (1) exact permutation enumerating all possible arrangements; (2) approximate permutation based on random sampling from all possible permutations; (3) approximate permutation using the analytical moments of the exact permutation distribution under the null hypothesis. The main disadvantage of the exact permutation is the computational cost, due to the factorial increase in the number of permutations with the increasing number of subjects. The second technique often gives inflated type I errors caused by random sampling. When a large number of repeated tests are needed, the random permutation strategy is also computationally expensive to achieve satisfactory accuracy. Regarding the third approach, the exact permutation distribution may not have moments or moments with tractability. In most applications, it is not the existence but the derivation of moments that limits the third approach. To the best of our knowledge, there is no systematic and efficient way to derive the moments of the permutation distribution. Recently, Zhou [3] proposed a solution by converting the permutation of data to that of the statistic coefficients that are symmetric to the permutation. Since the test statistic coefficients usually have simple presentations, it is easier to track the permutation of the test statistic coefficients than that of data. However, this method requires the derivation of the permutation for each specific test statistic, which is not accessible to practical users. In this paper, we propose a novel strategy by employing a general theoretical method to derive the moments of the permutation distribution of any weighted v-statistics, for both univariate and multivariate data. We note that any moments of the permutation distribution for weighted v-statistics [9] can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Our key idea is to divide the whole index set into several permutation equivalent (see Definition 2) index subsets such that the summation of the data/index function term over all permutations is invariant within each subset and can be calculated without conducting any permutation. Then we can obtain the moments by summing up several subtotals. The proposed method can be extended to equivalent weighted v-statistics by replacing them with monotonic weighted v-statistics. This is due to the fact that only the order of test statistics of all permutations matters for obtaining the p-values, so that the monotonic weighted v-statistics shares the same p-value with the original test statistic. Given the first four moments, the permutation distribution can be well fitted by Pearson distribution series. The p-values are then obtained without conducting any real permutation. For multiple comparison of two-group difference, given the sample size n1 = 21 and n2 = 21, the number of tests m = 2,000, we need to conduct m×(n1 + n2 )!/n1 !/n2 ! 1.1×1015 permutations for the exact permutation test. Even for 20,000 random permutations per test, we still need m×20,000 4×107 permutations. Alternatively, our moments-based permutation method using Pearson distribution approximation only involves the calculation of the first four analytically-derived moments of exact permutation distributions to achieve high accuracy (see section 3). Instead of calculating test statistics in factorial scale with exact permutation, our moments-based permutation only requires computation of polynomial order. For example, the computational cost for univariate mean difference test statistic and modified multivariate Hotelling's T2 test statistics [8] are O(n) and O(n3 ), respectively, where n = n 1 + n2 . 2 M e t h o d o lo g y In this section, we shall mainly discuss how to calculate the moments of the permutation distribution for weighted v-statistics. For other test statistics, a possible solution is to replace them with their equivalent weighted v-statistics by monotonic transforms. The detailed discussion about equivalent test statistics can be found in [7, 8, 10]. 2.1 C o m p ut a t i o n a l c h a l l e n g e Let us first look at a toy example. Suppose we have a two-group univariate data x = ( x1 , L , xn1 , xn1 +1 ,L , xn1 + n2 ) , where the first n1 elements are in group A and the rest, n2 ,are in group B. For comparison of the two groups, the hypothesis is typically constructed as: H 0 : m A = mB vs. H a : m A ¹ m B , where m A , m B are the population means of the groups A n1 n i =1 i = n1 +1 and B, respectively. Define x A = å xi / n1 and xB = å xi / n2 as the sample means of two groups, where n=n1+n2. We choose the univariate group mean difference statistic, i.e., n T ( x) = x A - xB = å w(i ) xi , i =1 where the index function as the test w(i ) = 1/ n1 , if i Î {1, L , n1} and w(i ) = -1/ n2 , if i Î {n1 + 1, L, n} . Then the total number of all possible permutations of {1, L, n} is n!. To calculate the fourth moment of the permutation distribution, n n n n n 1 1 4 å ( å w(i ) xp (i ) ) = å å å å å w(i1 ) w(i2 ) w(i3 ) w(i4 )xp ( i1 ) xp ( i2 ) xp ( i3 ) xp ( i4 ) , n ! p ÎS n i =1 n ! p ÎS n i1 =1 i2 =1 i3 =1 i4 =1 where is the permutation operator and the symmetric group Sn [11] includes all distinct permutations. The above example shows that the moment calculation can be considered as a summation over all possible permutations and a large index set. It is noticeable that the computational challenge here is to go through the factorial level permutations and polynomial level indices. Ep (T 4 ( x ))= 2.2 P a r t i t i o n t h e i n de x s e t In this paper, we assume that the test statistic T can be expressed as a weighted v-statistic of n n i1 =1 id =1 degree d [9], that is, T ( x) = å L å w(i1 , L , id ) h( xi1 ,L , xid ) , where x = ( x1 , x 2 , L , x n ) T is a data with n observations, and w is a symmetric index function. h is a symmetric data function, i.e., invariant under permutation of (i1 ,L , id ) . Though the symmetry property is not required for our method, it helps reduce the computational cost. Here, each observation xk can be either univariate or multivariate. In the above toy example, d=1 and h is the identity function. Therefore, the r-th moment of the test statistic from the permutated data is: Ep (T r ( x)) = Ep ( å w(i1 ,L , id )h( xp (i1 ) ,L , xp ( id ) ))r i1 ,i2 ,L,id = Ep [ å i1(1) ,L, id (1) , r r k =1 k =1 { Õ w(i1(k ) ,L, id ( k ) ) Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) )}] . L i1( r ) ,L,id ( r ) d 1 Then we can exchange the summation order of permutations and that of indices, Ep (T r ( x)) = å i1(1) ,L, id (1) , L i1( r ) ,L,id ( r ) r r k =1 k =1 {( Õ w(i1( k ) ,L , id ( k ) )) Ep ( Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) ))}. d 1 Thus any moment of permutation distribution can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Since all possible permutations map any index value between 1 and n to all possible index r values from 1 to n with equal probability, Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) , the summation of k =1 1 d data function over all permutations is only related to the equal/unequal relationship among indices. It is natural to divide the whole index set U = {i1 ,L , id }r = {(i1(1) , L , id (1) ), L (r ) , ( i1 , L , id (r ) r )} into the union of disjoint index subsets, in which Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) k =1 1 d is invariant. Definition 1. Since h is a symmetric function, two index elements (i1 ,L , id ) and ( j1 ,L , jd ) are said to be equivalent if they are the same up to the order. For example, for d = 3, (1, 4, 5) = (1,5,4) = (4,1,5) = (4,5,1) = (5,1,4) = (5,4,1). Definition 2. Two indices {(i1(1) , L , id (1) ), L , (i1( r ) , L , id ( r ) )} and {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} are said to be permutation equivalent/ if there exists a permutation p Î Sn such that {(p (i1(1) ), L , p (id (1) )), L , (p (i1( r ) ), L , p (id ( r ) ))} = {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} . Here
3 0.62312299 206 nips-2009-Riffled Independence for Ranked Data
Author: Jonathan Huang, Carlos Guestrin
Abstract: Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling rankings. We identify a novel class of independence structures, called riffled independence, which encompasses a more expressive family of distributions while retaining many of the properties necessary for performing efficient inference and reducing sample complexity. In riffled independence, one draws two permutations independently, then performs the riffle shuffle, common in card games, to combine the two permutations to form a single permutation. In ranking, riffled independence corresponds to ranking disjoint sets of objects independently, then interleaving those rankings. We provide a formal introduction and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers. 1
4 0.49005482 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
Author: Peter Orbanz
Abstract: We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. 1
5 0.47924647 230 nips-2009-Statistical Consistency of Top-k Ranking
Author: Fen Xia, Tie-yan Liu, Hang Li
Abstract: This paper is concerned with the consistency analysis on listwise ranking methods. Among various ranking methods, the listwise methods have competitive performances on benchmark datasets and are regarded as one of the state-of-the-art approaches. Most listwise ranking methods manage to optimize ranking on the whole list (permutation) of objects, however, in practical applications such as information retrieval, correct ranking at the top k positions is much more important. This paper aims to analyze whether existing listwise ranking methods are statistically consistent in the top-k setting. For this purpose, we define a top-k ranking framework, where the true loss (and thus the risks) are defined on the basis of top-k subgroup of permutations. This framework can include the permutationlevel ranking framework proposed in previous work as a special case. Based on the new framework, we derive sufficient conditions for a listwise ranking method to be consistent with the top-k true loss, and show an effective way of modifying the surrogate loss functions in existing methods to satisfy these conditions. Experimental results show that after the modifications, the methods can work significantly better than their original versions. 1
6 0.46483862 69 nips-2009-Discrete MDL Predicts in Total Variation
7 0.44198251 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification
8 0.43088913 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
9 0.42094174 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank
10 0.39648104 138 nips-2009-Learning with Compressible Priors
11 0.39598328 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
12 0.39418355 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis
13 0.38369939 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity
14 0.37798056 11 nips-2009-A General Projection Property for Distribution Families
15 0.36057925 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
16 0.35757801 87 nips-2009-Exponential Family Graph Matching and Ranking
17 0.35641497 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction
18 0.35126209 31 nips-2009-An LP View of the M-best MAP problem
19 0.34980056 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
20 0.34649968 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
topicId topicWeight
[(21, 0.015), (24, 0.055), (25, 0.063), (35, 0.046), (36, 0.093), (39, 0.029), (43, 0.334), (58, 0.096), (61, 0.016), (71, 0.075), (81, 0.015), (86, 0.059), (91, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.73247319 7 nips-2009-A Data-Driven Approach to Modeling Choice
Author: Vivek Farias, Srikanth Jagabathula, Devavrat Shah
Abstract: We visit the following fundamental problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal preference information), how may one predict revenues from offering a particular assortment of choices? This problem is central to areas within operations research, marketing and econometrics. We present a framework to answer such questions and design a number of tractable algorithms (from a data and computational standpoint) for the same. 1
Author: Kwang I. Kim, Florian Steinke, Matthias Hein
Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1
3 0.48671973 97 nips-2009-Free energy score space
Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic
Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.
4 0.48500353 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes
Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1
5 0.48435006 260 nips-2009-Zero-shot Learning with Semantic Output Codes
Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell
Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1
6 0.48384529 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
7 0.48332813 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
8 0.48309791 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
9 0.48289186 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
10 0.48279575 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
11 0.48185003 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
12 0.48145449 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
13 0.48020115 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
14 0.4801169 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
15 0.47931111 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals
16 0.4789744 113 nips-2009-Improving Existing Fault Recovery Policies
17 0.47888681 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
18 0.47854665 71 nips-2009-Distribution-Calibrated Hierarchical Classification
19 0.47832969 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
20 0.47708836 100 nips-2009-Gaussian process regression with Student-t likelihood