nips nips2010 nips2010-164 knowledge-graph by maker-knowledge-mining

164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

Source: pdf

Author: Akshat Kumar, Shlomo Zilberstein

Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is signiﬁcantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. [sent-5, score-0.117]

2 We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. [sent-8, score-0.259]

3 The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. [sent-9, score-0.174]

4 Experiments on the real-world protein design dataset show that EM’s convergence rate is signiﬁcantly higher than the previous LP relaxation based approach MPLP. [sent-10, score-0.312]

5 EM also achieves a solution quality within 95% of optimal for most instances. [sent-11, score-0.202]

6 For many practical problems modeled using MRFs, ﬁnding the maximum a posteriori (MAP) assignment or the most probable assignment to the variables in the graph is a key inference problem. [sent-14, score-0.332]

7 For example, MAP estimation has been applied to image processing in computer vision [17, 11], protein design and protein side-chain prediction problems [17, 11], and natural language processing [7]. [sent-15, score-0.497]

8 Particularly, linear programming (LP) relaxation of the MAP problem has emerged as a popular technique to solve large-scale problems such as protein design and prediction problems [17, 10, 11]. [sent-19, score-0.378]

9 However, for large problems such as protein design, the large size of the LP prohibits the application of standard LP-solvers [17]. [sent-21, score-0.232]

10 To alleviate such scalability issues, convergent message passing algorithms have been introduced, which monotonically decrease the dual objective of the LP relaxation [5, 11, 9]. [sent-22, score-0.304]

11 The main advantage of these approaches lies in their ability to provide an upper bound on the problem and a certiﬁcate of optimality when upper bound is sufﬁciently close to the decoded solution. [sent-24, score-0.205]

12 First, we present an alternate representation of the MAP problem by decomposing the MRF into a ﬁnite-mixture of simple Bayes nets in which maximizing the likelihood of a special variable is equivalent to solving the MAP problem. [sent-26, score-0.184]

13 EM increases the lower bound on the MAP assignment monotonically until convergence and lends itself naturally to a graph-based message passing implementation. [sent-29, score-0.366]

14 In our experiments on some of the largest protein design problems [17, 11], we show that EM increases the lower bound on MAP rapidly. [sent-31, score-0.383]

15 This attribute of EM combined with the Max-Product LP algorithm (MPLP) [5, 11] that decreases the upper bound rapidly (as observed empirically) yields a new hybrid approach that provides quality-bounded solutions signiﬁcantly faster than previous approaches. [sent-32, score-0.145]

16 Although convergence to the global optima is not guaranteed, EM achieves an average solution quality within 95% of optimal for the protein design problems and is signiﬁcantly faster than both MPLP [5, 11] and max-product (MP) [8]. [sent-33, score-0.565]

17 An edge (i, j) between nodes xi and xj speciﬁes a function θij . [sent-43, score-0.258]

18 Z The MAP problem consists of ﬁnding the most probable assignment to all variables under p(x; θ). [sent-45, score-0.147]

19 This is equivalent to ﬁnding the complete assignment x that maximizes the function f (x; θ) = ij∈E θij (xi , xj ). [sent-46, score-0.221]

20 Before describing our formulation of the MAP problem, we ﬁrst describe the marginal polytope associated with the MAP problem and its outer bound based on LP relaxation. [sent-47, score-0.249]

21 Let µ denote a vector of marginal probabilities (also called mean parameters) for each node and edge of the MRF. [sent-50, score-0.125]

22 That is, µ includes µi (xi ) ∀i ∈ V and µij (xi , xj ) ∀(i, j) ∈ E. [sent-51, score-0.123]

23 To remedy this, LP relaxations are proposed that outer bound the polytope M(G). [sent-56, score-0.217]

24 Inner bound on the marginal polytope In the deﬁnition of marginal polytope M(G) (Eq. [sent-61, score-0.316]

25 Consider the following class of probability n distributions that factorize according to the variables of the MRF, p (x) = i=1 pi (xi ). [sent-63, score-0.104]

26 µi (xi ) = pi (xi ) , µij (xi , xj ) = pi (xi )pj (xj )} (5) where p is the distribution that factorizes according the variables in the MRF. [sent-67, score-0.301]

27 The optimization criterion for estimating MAP under this set is: max flb (x; θ) = x max µ·θ = µ∈Mlb (G) 1 max µ∈Mlb (G) 1 θij (xi , xj )µi (xi )µj (xj ) (6) ij∈E xi xj Let flb denote the optimizing value for the above formulation and f for the formulation in Eq. [sent-69, score-0.646]

28 The reason is that there always exists a maximizing µ ∈ M(G) that is integral and thus f corresponds to an integral assignment x which is also the MAP assignment [16]. [sent-73, score-0.316]

29 Since all the integral assignments are also allowed by the deﬁnition of the factored distribution p and Mlb (G), it follows that optimizing over Mlb (G) implies flb = f and yields the MAP estimate. [sent-74, score-0.119]

30 The reduced set of parameters Mlb (G) is non-convex because of the non-linear constraint µij (xi , xj ) = µi (xi )µj (xj ) [16]. [sent-78, score-0.123]

31 Then we present the Expectation Maximization (EM) algorithm [4] for this reformulation that monotonically increases the lower bound on the MAP assignment using likelihood maximization until convergence. [sent-82, score-0.302]

32 The key idea is to decompose the MRF into a mixture of simpler Bayes nets with many hidden variables – all the variables xi of the MRF. [sent-85, score-0.303]

33 To incorporate the potential functions θ’s of the MRF and achieve equivalence between ˆ the likelihood and MAP value, a special binary reward variable θ is introduced with its conditional distribution proportional to potentials θ. [sent-86, score-0.211]

34 It ˆ consists of a binary reward variable θ with its parents being the variables xi and xj . [sent-89, score-0.394]

35 The reason for calling it a reward variable will become clear later. [sent-90, score-0.162]

36 1(b) shows the equivalent mixture of Bayes nets for each of the four edges in this MRF. [sent-94, score-0.169]

37 The mixture random variable l, which is used to identify the Bayes nets, can take values from 1 to |E|, the number of edges in the graph, with uniform probability. [sent-95, score-0.117]

38 The parameters to estimate in this mixture are the marginal probabilities for each node xi . [sent-98, score-0.253]

39 This is done as follows: θl (xl1 , xl2 ) − θmin ˆ P (θ = 1|xl1 , xl2 , l) = (7) θmax − θmin where l indicates a particular Bayes net corresponding to an edge of the MRF, xl1 and xl2 are ˆ the parent variables of θ in this Bayes net and θl the potential function for this edge. [sent-106, score-0.156]

40 For this reason, θ is also called a reward variable. [sent-111, score-0.117]

41 Let us denote the variables (xl1 , xl2 ) by ˆ ˆ xl and let θxl denote the probability P (θ = 1|xl1 , xl2 , l), then the following theorem establishes the link between the likelihood and MAP value. [sent-115, score-0.342]

42 Let the CPT of binary reward variable θ be selected such that θxl ∝ θl (xl ). [sent-119, score-0.143]

43 Then p ˆ = 1; p) of observing the reward variable in the mixture of maximizing the likelihood L = P (θ Bayes nets is equivalent to the MAP estimation of the original MRF. [sent-120, score-0.368]

44 (9) xl xl For the complete mixture, it is given by P (l)Lp = l Lp = l 1 k ˆ θxl pl1 (xl1 ; p)pl2 (xl2 ; p). [sent-123, score-0.542]

45 l (10) xl ˆ Upon substituting the deﬁnition of θxl from Eq. [sent-124, score-0.271]

46 l xl Notice that the LHS of the above equation is the same as the optimization objective in Eq. [sent-126, score-0.271]

47 The goal is to achieve the higher reward θmax for each edge in the MRF. [sent-132, score-0.154]

48 4 EM Algorithm for MAP Estimation We now derive the EM algorithm [4] for maximizing the likelihood of the reward variable in the ˆ mixture of Bayes nets. [sent-134, score-0.29]

49 xi = argmaxxi bi (ˆi ) x ˆ EM : Return complete assignment x s. [sent-137, score-0.231]

50 xi = argmaxxi pi (ˆi ) x ˆ γk→i (xi ) δk→i (xi ) Ci k∈Ne(i) other variables are latent. [sent-139, score-0.237]

51 However, our experiments show that EM achieves an average solution quality within 95% of optimal for the standard MAP benchmark of protein design problems. [sent-141, score-0.467]

52 We also show that the update equations for EM can be implemented efﬁciently using graph-based message passing and are computationally much faster than other message-passing algorithms such as max-product [8] and MPLP [5]. [sent-142, score-0.207]

53 The parameters p to estimate are the marginal probabilities pi for each variable xi . [sent-145, score-0.249]

54 ˆ ˆ P (θ = 1, xl , l; p) log P (θ = 1, xl , l; p ) Q(p, p ) = (11) xl l The full joint is given by: 1ˆ ˆ ˆ P (θ = 1, xl , l; p) = P (θ = 1|xl , l)P (xl |l; p)P (l) = θx pl (xl ; p)pl2 (xl2 ; p) k l 1 1 We will omit the parameter p whenever the expression is unambiguous. [sent-148, score-1.118]

55 Taking the log, we get: ˆ log P (θ = 1, xl , l; p) = Ind. [sent-149, score-0.271]

56 The above expression can be easily maximized by maximizing for variables xi ’s individually. [sent-152, score-0.167]

57 The ﬁnal update equation for the marginals is given by: ˆ pi (xi ) j∈Ne(i) xj θxi xj pj (xj ) pi (xi ) = (15) Ci ˆ where Ci is the normalization constant for variable xi , and θx x is the normalized reward: i j ˆ P (θ = 1|xi , xj , l) = (θij (xi , xj ) − θmin )/(θmax − θmin ). [sent-153, score-0.794]

58 Algorithm 1 shows the graph-based message passing technique for both EM and MPLP. [sent-154, score-0.174]

59 0 90 100 -50 0 100 200 1700 3200 4700 Figure 2: a) Quality achieved by EM for all protein design instances. [sent-159, score-0.285]

60 Complexity analysis and implementation Consider a single message γ sent out by a node in MPLP. [sent-162, score-0.144]

61 EM’s simple message passing scheme also facilitates a very efﬁcient parallel implementation. [sent-169, score-0.198]

62 5 Experiments Our ﬁrst set of experiments are on the protein design problems (total of 97 instances), which are described in [17]. [sent-174, score-0.298]

63 We used the standard setting for MPLP – ﬁrst it is run with edge based clusters for 1000 iterations [5] and then clusters of size 3 are added to tighten the LP relaxation [11]. [sent-179, score-0.13]

64 The main purpose of our experiments is to show that EM achieves high solution quality, much more quickly than MPLP or max-product. [sent-187, score-0.105]

65 As reported in [11], for protein design problems solved exactly, mean running time was 9. [sent-189, score-0.298]

66 Empirically, EM achieves a solution quality within 95% of optimal on average much faster than MPLP. [sent-194, score-0.235]

67 The longest time EM took for any protein design instance was 352 sec. [sent-195, score-0.292]

68 0 4700 -50 0 100 200 1700 3200 4700 Figure 3: Quality comparison with MPLP for six of the largest protein design instances. [sent-210, score-0.292]

69 2(a) shows the solution quality EM achieves for all the instances in 1500 iterations. [sent-214, score-0.227]

70 Since a tight upper bound for all the problems except the instance ‘1fpo’ is known [11], we show the percentage of optimal EM achieves. [sent-215, score-0.153]

71 For the unsolved instance ‘1fpo’, we use the best known upper bound MPLP achieved in 10 hours (≈ 434). [sent-218, score-0.14]

72 As it is clear from this graph, EM achieves near-optimal solution quality for all the instances, within 95% on average. [sent-219, score-0.202]

73 To show empirically that MPLP decreases the upper bound quickly, we also show the percentage of solution quality EM achieves when instead of using the best known upper bound, we use the upper bound provided by MPLP after 1,000 iterations. [sent-220, score-0.446]

74 Even using this bound, EM achieves a quality within 91% on average (legend ‘AVG-UB’). [sent-223, score-0.169]

75 This further suggests that combining EM’s ability to rapidly increase the lower bound and MPLP’s ability to decrease the upper bound quickly, is a good way to create a hybrid approach that can provide provably near-optimal solutions much faster. [sent-224, score-0.17]

76 2(b) shows the quality achieved by EM and MPLP as a function of time for the largest instance ‘1fpo’. [sent-226, score-0.198]

77 This graph also shows that EM provides a much better solution quality, much faster than the MPLP. [sent-228, score-0.101]

78 3 shows the quality comparisons with time for some of the largest protein design instances. [sent-234, score-0.416]

79 Each graph title shows the instance name, N denotes the number of variables in the MRF, C denotes the number of potential functions θ or edges in the graph. [sent-235, score-0.199]

80 For all these problems, EM provided near-optimal solution quality and is signiﬁcantly faster than MPLP. [sent-236, score-0.19]

81 Table 1 shows this comparison for some of the largest protein design instances. [sent-238, score-0.292]

82 In this table, MP Quality denotes the best quality max-product achieved in 1500 iterations, Time/Iteration denotes the time and iteration number when it was achieved for the ﬁrst time. [sent-239, score-0.238]

83 Again, EM outperforms max-product by a signiﬁcant margin, achieving a higher solution quality much faster. [sent-240, score-0.157]

84 For some of the problems, such as ‘1tmy’ and ‘1or7’, the quality achieved by EM was much higher. [sent-241, score-0.144]

85 We also tested EM on the protein prediction problems [17, 11] which are simpler and sparser than the protein design problems. [sent-256, score-0.497]

86 The LP relaxations in this case can be solved even by the standard LP solvers unlike the protein design problems [17]. [sent-257, score-0.338]

87 3 434 Table 1: Solution quality and time (in sec. [sent-299, score-0.124]

88 The reason for this lies in the reward structure of the problem or the values θmin and θmax . [sent-307, score-0.155]

89 As shown earlier, EM works with the normalized rewards assigning 0 to the minimum reward −5770. [sent-311, score-0.117]

90 This dramatic scaling of the reward is particularly problematic for EM as shown below. [sent-314, score-0.136]

91 However the drastic scaling of the reward causes this minor difference to signiﬁcantly affect solution quality. [sent-320, score-0.15]

92 In contrast, for the largest protein design instance ‘1fpo’, the minimum reward is −59. [sent-322, score-0.436]

93 The difference between the maximum and minimum reward in these problems was less than 5, with a typical setting: θmin ≈ −2. [sent-327, score-0.15]

94 In contrast, our proposed formulation seeks to provide good quality solutions quickly by directly maximizing a lower bound on the MAP value over the inner bound on the marginal polytope. [sent-334, score-0.397]

95 The proposed Expectation Maximization (EM) algorithm increases this lower bound monotonically by likelihood maximization and is guaranteed to converge. [sent-335, score-0.194]

96 Furthermore, EM’s update equations can be efﬁciently implemented using a graph-based message passing paradigm. [sent-336, score-0.174]

97 Although EM may get stuck at a local optimum, our empirical results on the protein design dataset show that EM performs very well, producing solutions within 95% of optimal on average. [sent-337, score-0.306]

98 EM achieves such high solution quality signiﬁcantly faster than MPLP or max-product for many large protein design problems. [sent-338, score-0.5]

99 Our ongoing efforts include incorporating some of the advanced clustering techniques based on LP relaxation of the MAP problem with the EM method, and designing heuristics that can help EM avoid getting stuck in local optima for problems with large variations in the reward structure. [sent-342, score-0.269]

100 Fixing Max-Product: Convergent message passing algorithms for MAP LP-relaxations. [sent-373, score-0.174]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('em', 0.519), ('mplp', 0.497), ('xl', 0.271), ('lp', 0.23), ('protein', 0.199), ('mlb', 0.177), ('map', 0.157), ('quality', 0.124), ('xj', 0.123), ('mrf', 0.121), ('reward', 0.117), ('message', 0.107), ('xi', 0.098), ('assignment', 0.098), ('ij', 0.098), ('bayes', 0.092), ('flb', 0.088), ('nets', 0.078), ('polytope', 0.078), ('pi', 0.074), ('mixture', 0.067), ('passing', 0.067), ('design', 0.066), ('bound', 0.058), ('legend', 0.057), ('marginal', 0.051), ('relaxation', 0.047), ('achieves', 0.045), ('mp', 0.043), ('toussaint', 0.043), ('likelihood', 0.041), ('outer', 0.041), ('maximization', 0.041), ('relaxations', 0.04), ('speedup', 0.039), ('maximizing', 0.039), ('sontag', 0.038), ('node', 0.037), ('edge', 0.037), ('monotonically', 0.036), ('akshat', 0.035), ('argmaxxi', 0.035), ('shlomo', 0.035), ('titled', 0.035), ('graph', 0.035), ('upper', 0.035), ('pl', 0.034), ('problems', 0.033), ('solution', 0.033), ('faster', 0.033), ('optima', 0.032), ('net', 0.031), ('mapreduce', 0.031), ('globerson', 0.031), ('integral', 0.031), ('pj', 0.03), ('messages', 0.03), ('variables', 0.03), ('convergent', 0.028), ('max', 0.028), ('denotes', 0.028), ('reformulation', 0.028), ('potential', 0.027), ('quickly', 0.027), ('meltzer', 0.027), ('ub', 0.027), ('largest', 0.027), ('instance', 0.027), ('variable', 0.026), ('min', 0.026), ('planning', 0.026), ('instances', 0.025), ('edges', 0.024), ('amherst', 0.024), ('parallel', 0.024), ('send', 0.023), ('processor', 0.023), ('deg', 0.023), ('empirically', 0.023), ('clusters', 0.023), ('stuck', 0.022), ('formulation', 0.021), ('pairwise', 0.021), ('achieved', 0.02), ('ml', 0.02), ('probable', 0.019), ('mrfs', 0.019), ('lies', 0.019), ('inference', 0.019), ('solutions', 0.019), ('alleviate', 0.019), ('ci', 0.019), ('reason', 0.019), ('particularly', 0.019), ('iteration', 0.018), ('markov', 0.018), ('advanced', 0.018), ('neighbors', 0.018), ('cantly', 0.018), ('guaranteed', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

Author: Akshat Kumar, Shlomo Zilberstein

2 0.39229834 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan

Abstract: We propose a new LP relaxation for obtaining the MAP assignment of a binary MRF with pairwise potentials. Our relaxation is derived from reducing the MAP assignment problem to an instance of a recently proposed Bipartite Multi-cut problem where the LP relaxation is guaranteed to provide an O(log k) approximation where k is the number of vertices adjacent to non-submodular edges in the MRF. We then propose a combinatorial algorithm to efﬁciently solve the LP and also provide a lower bound by concurrently solving its dual to within an approximation. The algorithm is up to an order of magnitude faster and provides better MAP scores and bounds than the state of the art message passing algorithm of [1] that tightens the local marginal polytope with third-order marginal constraints. 1

3 0.1895756 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions

Author: Bo Thiesson, Chong Wang

Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-speciﬁc data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is signiﬁcantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1

4 0.15718842 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points

Author: Meritxell Vinyals, Jes\'us Cerquides, Alessandro Farinelli, Juan A. Rodríguez-aguilar

Abstract: We study worst-case bounds on the quality of any ﬁxed point assignment of the max-product algorithm for Markov Random Fields (MRF). We start providing a bound independent of the MRF structure and parameters. Afterwards, we show how this bound can be improved for MRFs with speciﬁc structures such as bipartite graphs or grids. Our results provide interesting insight into the behavior of max-product. For example, we prove that max-product provides very good results (at least 90% optimal) on MRFs with large variable-disjoint cycles1 . 1

5 0.14552335 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

Author: Armand Joulin, Jean Ponce, Francis R. Bach

Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classiﬁcation to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added beneﬁts in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efﬁcient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classiﬁcation show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1

6 0.13248296 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

7 0.098201528 284 nips-2010-Variational bounds for mixed-data factor analysis

8 0.083992168 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

9 0.074725583 235 nips-2010-Self-Paced Learning for Latent Variable Models

10 0.072108492 229 nips-2010-Reward Design via Online Gradient Ascent

11 0.066421106 283 nips-2010-Variational Inference over Combinatorial Spaces

12 0.064930938 101 nips-2010-Gaussian sampling by local perturbations

13 0.063138053 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

14 0.062206231 108 nips-2010-Graph-Valued Regression

15 0.059106499 118 nips-2010-Implicit Differentiation by Perturbation

16 0.056296401 257 nips-2010-Structured Determinantal Point Processes

17 0.055769201 63 nips-2010-Distributed Dual Averaging In Networks

18 0.054795008 217 nips-2010-Probabilistic Multi-Task Feature Selection

19 0.05437097 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates

20 0.054276727 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, 0.014), (2, 0.051), (3, 0.026), (4, -0.107), (5, -0.018), (6, 0.004), (7, 0.029), (8, 0.156), (9, 0.058), (10, -0.202), (11, -0.083), (12, 0.175), (13, -0.009), (14, -0.073), (15, -0.102), (16, -0.168), (17, -0.027), (18, 0.13), (19, -0.019), (20, 0.116), (21, -0.09), (22, -0.101), (23, 0.221), (24, 0.078), (25, -0.184), (26, -0.185), (27, 0.146), (28, -0.155), (29, -0.045), (30, -0.093), (31, 0.011), (32, -0.016), (33, 0.089), (34, 0.258), (35, -0.122), (36, 0.088), (37, 0.064), (38, -0.027), (39, -0.034), (40, 0.028), (41, 0.013), (42, 0.024), (43, 0.001), (44, 0.015), (45, -0.045), (46, 0.02), (47, -0.013), (48, 0.052), (49, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96507806 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

Author: Akshat Kumar, Shlomo Zilberstein

2 0.85639358 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan

3 0.72774774 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points

Author: Meritxell Vinyals, Jes\'us Cerquides, Alessandro Farinelli, Juan A. Rodríguez-aguilar

4 0.69466722 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions

Author: Bo Thiesson, Chong Wang

5 0.43943694 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difﬁculty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufﬁciently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to ﬁnd a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difﬁcult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be deﬁned by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will ﬁnd some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary ﬁnite data set. To our knowledge, this is the ﬁrst result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classiﬁer uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is deﬁned as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to ﬁnd m=1 weights θ that correctly classify the training examples. Consider ﬁrst the separable case. Deﬁne the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is deﬁned by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while ﬁxing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while ﬁxing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random ﬁelds (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by conﬁgurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing conﬁgurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to ﬁnd a violated constraint. For the pseudo-max objective, ﬁnding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisﬁes particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identiﬁed to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simpliﬁed instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identiﬁability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to inﬁnity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four conﬁgurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufﬁciently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the conﬁgurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the ﬁgure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identiﬁability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction deﬁned on a ﬁxed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node ﬁelds θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k speciﬁes the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input ﬁelds xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufﬁcient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classiﬁcation constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input ﬁelds xi (yi ) already sufﬁce to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the ﬁrst example, we set the input ﬁelds for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input ﬁelds can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ ﬁeld for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is ﬁnite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for ﬁnite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Deﬁne Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any ﬁnite xi and ﬁnite J ∗ , the other xj ’s can be chosen so as to give the y conﬁguration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we deﬁne h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and deﬁne ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the ﬁrst term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function deﬁned in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difﬁculty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certiﬁcate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a ﬁxed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We ﬁrst construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The ﬁrst i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difﬁculty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subﬁgure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classiﬁcation task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not ﬂattened either), or because the consistency conditions are not fully satisﬁed: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random ﬁeld (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to ﬁnding the most likely assignment in a pairwise MRF, and ﬁts naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for ﬁnding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We signiﬁcantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Speciﬁcally, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly ﬁve times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difﬁculty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to ﬁnd parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufﬁcient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufﬁcient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufﬁciently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-speciﬁcation [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9

6 0.41257322 283 nips-2010-Variational Inference over Combinatorial Spaces

7 0.39458019 284 nips-2010-Variational bounds for mixed-data factor analysis

8 0.3719427 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

9 0.36572653 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

10 0.34128156 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems

11 0.3380816 215 nips-2010-Probabilistic Deterministic Infinite Automata

12 0.32147151 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

13 0.31211692 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates

14 0.30852965 102 nips-2010-Generalized roof duality and bisubmodular functions

15 0.29146445 214 nips-2010-Probabilistic Belief Revision with Structural Constraints

16 0.2872116 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures

17 0.28652379 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

18 0.28320235 234 nips-2010-Segmentation as Maximum-Weight Independent Set

19 0.28284577 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

20 0.25217244 235 nips-2010-Self-Paced Learning for Latent Variable Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.083), (13, 0.028), (17, 0.011), (27, 0.098), (30, 0.09), (35, 0.014), (37, 0.01), (45, 0.228), (50, 0.074), (52, 0.023), (60, 0.163), (72, 0.017), (77, 0.029), (78, 0.013), (90, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93487966 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification

Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee

Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classiﬁcation. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from ﬁnite sampling effects, and ﬁnd an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1

2 0.93273479 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

Author: Andreas Krause, Pietro Perona, Ryan G. Gomes

Abstract: Is there a principled way to learn a probabilistic discriminative classiﬁer from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classiﬁer. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classiﬁer complexity. The approach can ﬂexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1

3 0.92267305 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan

4 0.91490918 223 nips-2010-Rates of convergence for the cluster tree

Author: Kamalika Chaudhuri, Sanjoy Dasgupta

Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give ﬁnite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1

5 0.91357166 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

Author: Tobias Glasmachers

Abstract: Steinwart was the ﬁrst to prove universal consistency of support vector machine classiﬁcation. His proof analyzed the ‘standard’ support vector machine classiﬁer, which is restricted to binary classiﬁcation problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classiﬁcation to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1

same-paper 6 0.90589643 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

7 0.90295291 229 nips-2010-Reward Design via Online Gradient Ascent

8 0.88591653 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

9 0.87629735 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

10 0.87438399 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

11 0.87283427 287 nips-2010-Worst-Case Linear Discriminant Analysis

12 0.86981767 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.86876124 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates

14 0.86856151 63 nips-2010-Distributed Dual Averaging In Networks

15 0.8680281 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

16 0.86554348 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

17 0.86548829 155 nips-2010-Learning the context of a category

18 0.86516315 280 nips-2010-Unsupervised Kernel Dimension Reduction

19 0.86486346 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

20 0.86439639 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points