nips nips2007 nips2007-187 knowledge-graph by maker-knowledge-mining

187 nips-2007-Structured Learning with Approximate Inference

Source: pdf

Author: Alex Kulesza, Fernando Pereira

Abstract: In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufﬁcient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. [sent-3, score-0.542]

2 However, when inference is used in a learning algorithm, a good approximation of the score may not be sufﬁcient. [sent-4, score-0.29]

3 We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. [sent-5, score-0.338]

4 In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. [sent-9, score-0.747]

5 We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. [sent-10, score-0.626]

6 1 Introduction Structured prediction models commonly involve complex inference problems for which ﬁnding exact solutions is intractable [1]. [sent-11, score-0.328]

7 Directly, models used in practice can be restricted to those for which inference is feasible, such as conditional random ﬁelds on trees [2] or associative Markov networks with binary labels [3]. [sent-13, score-0.381]

8 Since some form of inference is the dominant subroutine for all structured learning algorithms, it is natural to see good approximate inference techniques as solutions to the problem of tractable learning as well. [sent-15, score-0.685]

9 A number of authors have taken this approach, using inference approximations as drop-in replacements during training, often with empirical success [3, 8]. [sent-16, score-0.29]

10 And yet there has been little theoretical analysis of the relationship between approximate inference and reliable learning. [sent-17, score-0.338]

11 We demonstrate with two counterexamples that the characteristics of approximate inference algorithms relevant for learning can be distinct from those, such as approximation guarantees, that make them appropriate for prediction. [sent-18, score-0.389]

12 First, we show that approximations can reduce the expressivity of a model, making previously simple concepts impossible to implement and hence to learn, even though inference meets an approximation guarantee. [sent-19, score-0.391]

13 It is therefore crucial to choose compatible inference and learning procedures. [sent-21, score-0.299]

14 1 With these considerations in mind, we prove that LP-relaxation-based approximate inference procedures are compatible with the structured perceptron [9] as well as empirical risk minimization with a margin criterion using the PAC-Bayes framework [10, 11]. [sent-23, score-0.872]

15 2 Setting Given a scoring model S(y|x) over candidate labelings y for input x, exact Viterbi inference is the computation of the optimal labeling h(x) = arg max S(y|x) . [sent-24, score-0.735]

16 (1) y In a prediction setting, the goal of approximate inference is to compute efﬁciently a prediction with the highest possible score. [sent-25, score-0.338]

17 Instead, we assume a ﬁxed loss function L(y|x) that measures the true cost of predicting y given x, a distribution D over inputs x, and a parameterized scoring model Sθ (y|x) with associated optimal labeling function hθ and inference algorithm Aθ . [sent-27, score-0.464]

18 In this work we consider the impact of approximate inference on both criteria. [sent-31, score-0.338]

19 We model our examples as pairwise Markov random ﬁelds (MRFs) deﬁned over a graph G = (V, E) with probabilistic scoring model P (y|x) ∝ ψi (yi |x) i∈V ψij (yi , yj |x) , (3) ij∈E where ψi (yi |x) and ψij (yi , yj |x) are positive potentials. [sent-32, score-0.486]

20 For learning, we use log-linear potentials ψi (yi |x) = exp(w · f (x, yi )) assuming a feature function f (·) and parameter vector w. [sent-33, score-0.312]

21 Since MRFs are probabilistic, we also refer to Viterbi inference as maximum a posteriori (MAP) inference. [sent-34, score-0.257]

22 3 Algorithmic separability The existence of suitable model parameters θ is captured by the standard notion of separability. [sent-35, score-0.264]

23 However, approximate inference may not be able to match exactly the separating hypothesis hθ . [sent-38, score-0.38]

24 We need a notion of separability that takes into account the (approximate) inference algorithm. [sent-39, score-0.521]

25 A distribution D is algorithmically separable with respect to parameterized inference algorithm Aθ and loss L(y|x) if there exists θ such that Ex∼D [L(Aθ (x), x)] = 0. [sent-41, score-0.587]

26 While separability characterizes data distributions with respect to models, algorithmic separability characterizes data distributions with respect to inference algorithms. [sent-42, score-0.957]

27 Note that algorithmic separability is more general than standard separability for any decidable model, since we can design an (inefﬁcient) algorithm Aθ (x) = hθ (x)2 . [sent-43, score-0.636]

28 However, we show by counterexample that even algorithms with provable approximation guarantees can make separable problems algorithmically inseparable. [sent-44, score-0.289]

29 1 LP-relaxed inference Consider the simple Markov random ﬁeld pictured in Figure 1, a triangle in which each node has as its set of allowed labels a different pair of the three possible labels A, B, and C. [sent-46, score-0.543]

30 Let the node potentials ψi (yi ) be ﬁxed to 1 so that labeling preferences derive only from edge potentials. [sent-47, score-0.35]

31 Note further that algorithmic separability supports inference algorithms that are not based on any abstract model at all; such algorithms can describe arbitrary “black box” functions from parameters to predictions. [sent-49, score-0.629]

32 2 2 constants λij , deﬁne edge potentials ψij (yi , yj ) = exp(λij ) whenever yi = yj and ψij (yi , yj ) = 1 otherwise. [sent-51, score-0.971]

33 Then the joint probability of a conﬁguration y = (y1 , y2 , y3 ) is given by   P (y) ∝ exp(λij ) = exp  ij:yi =yj and the MAP labeling is arg maxy I(yi = yj )λij  (4) i,j i,j I(yi = yj )λij . [sent-52, score-0.587]

34 We can therefore perform approximate inference using a linear programming (LP) relaxation and get a multiplicative approximation guarantee [3]. [sent-54, score-0.338]

35 We begin by writing an integer program for computing the MAP labeling; below, µi (yi ) indicates node i taking label yi (which ranges over the two allowed labels for node i) and µij (yi , yj ) indicates nodes i and j taking labels yi and yj , respectively. [sent-55, score-1.36]

36 yi µij (yi , yj ) ≤ µi (yi ) ∀ij, yi , yj Figure 1: A simple MRF. [sent-58, score-0.794]

37 Letting i∗ j ∗ = arg maxij λij , it is easy to see that the correct MAP conﬁguration assigns matching labels to nodes i∗ and j ∗ and an arbitrary label to the third. [sent-61, score-0.314]

38 The fractional labeling µ = 1/2 is the most uninformative possible; it suggests that all labelings are equally valid. [sent-65, score-0.361]

39 Even so, (λ12 + λ23 + λ31 )/2 ≤ 3λi∗ j ∗ /2 by the deﬁnition of i∗ j ∗ , so LP-relaxed inference for this MRF has a relatively good approximation ratio of 3/2. [sent-66, score-0.257]

40 2 Learning with LP-relaxed inference Suppose now that we wish to learn to predict labelings y from instances of the MRF in Figure 1 with positive features given by x = (x12 , x23 , x31 ). [sent-68, score-0.49]

41 We will parameterize the model using a positive weight vector w = (w12 , w23 , w31 ), letting λij = wij xij . [sent-69, score-0.228]

42 Then assigning matching labels to nodes i∗ and j ∗ and an arbitrary label to the third node yields a 0-loss conﬁguration. [sent-72, score-0.287]

43 It is clear, ﬁrst of all, that this problem is separable; if w = (1, 1, 1), λij = xij and the solution to the integer program above coincides with the labeling rule. [sent-74, score-0.247]

44 In order to correctly label the instance x = (4, 3, 3) we must have, at a minimum, λ12 > λ23 , λ31 (equivalently 4w12 > 3w23 , 3w31 ) since the 0-loss labeling must have higher objective score than any other labeling. [sent-77, score-0.282]

45 3 As a result, LP-relaxed inference predicts µ = 1/2. [sent-81, score-0.257]

46 The data cannot be correctly labeled using an LP-relaxation with any choice of weight vector, and the example is therefore algorithmically inseparable. [sent-82, score-0.308]

47 4 Insufﬁciency of algorithmic separability We cannot expect to learn without algorithmic separability; no amount of training can hope to be successful when there simply do not exist acceptable model parameters. [sent-83, score-0.511]

48 Learning techniques exploit assumptions about the underlying model to search parameter space; the perceptron, for example, assumes that increasing weights for features present in correct labelings but not incorrect labelings will lead to better predictions. [sent-86, score-0.437]

49 While this is formally true with respect to an underlying linear model, inexact inference methods can disturb and even invert such assumptions. [sent-87, score-0.333]

50 1 Loopy inference Loopy belief propagation (LBP) is a common approximate inference procedure in which maxproduct message passing, known to be exact for trees, is applied to arbitrary, cyclic graphical models [5]. [sent-89, score-0.699]

51 Because LBP does not respond to model parameters in the usual way, its predictions can lead a learner away from appropriate parameters even for algorithmically separable problems. [sent-91, score-0.289]

52 Suppose that node potentials are assigned by type, where each node is of type A or B as indicated and α and β are real-valued parameters: ψA (−1) = 1 ψB (−1) = 1 Figure 2: An MRF on which LBP is inexact. [sent-94, score-0.286]

53 ψA (1) = eα ψB (1) = eβ Also let edge potentials ψij (yi , yj ) be equal to the constant λ when yi = yj and 1 otherwise. [sent-95, score-0.754]

54 In general, max-product LBP on pairwise MRFs requires iterating the following rule to update messages mij (yj ) from node i to node j, where yj ranges over the possible labels for node j and N (i) is the neighbor set of node i. [sent-100, score-0.883]

55   mij (yj ) = max ψij (yi , yj )ψi (yi ) yi mki (yi ) (5) k∈N (i)\{j} Since we take λ to be suitably positive in our example, we can eliminate the max, letting yi = yj , and then divide to remove the edge potentials ψij (yj , yj ) = λ. [sent-101, score-1.315]

56 2 Learning with LBP Suppose now that we wish to use the perceptron algorithm with LBP inference to learn the twoinstance data set shown in Figure 3. [sent-119, score-0.547]

57 For each instance the unshaded nodes are annotated with a feature vector xα = (xα1 , xα2 ) and the shaded nodes are annotated with a feature vector xβ = (xβ1 , xβ2 ). [sent-120, score-0.366]

58 Since we can think of the MAP decision problem as computing sign(α + β) = sign (w · (xα + xβ )), we can apply the perceptron algorithm with update w ← w − y (xα + xβ ), ˆ where y is the sign of the proposed labeling. [sent-130, score-0.373]

59 The standard ˆ perceptron mistake bound guarantees that separable problems require only a ﬁnite number of iterations with exact inference to ﬁnd a separating weight vector. [sent-131, score-0.895]

60 Here, however, LBP causes the perceptron to diverge even though the problem is not only separable but also algorithmically separable. [sent-132, score-0.548]

61 During each pass through the data the weight vector is updated twice: once after mislabeling instance (a) (w ← w − (1, 0. [sent-135, score-0.242]

62 3 Discussion To understand why perceptron learning fails with LBP, it is instructive to visualize the feasible regions of weight space. [sent-143, score-0.443]

63 Exact inference correctly labels instance (a) whenever w1 + 0. [sent-144, score-0.438]

64 089339, so a feasible weight vector must satisfy 5 (a) Exact inference (b) LBP Figure 5: The feasible regions of weight space for exact inference and LBP. [sent-149, score-0.991]

65 95γ > 1, these constraints deﬁne a completely different feasible region of weight space, shown in Figure 5(b). [sent-155, score-0.223]

66 It is clear from the ﬁgures why perceptron does not succeed; it assumes that pushing weights into the feasible region of Figure 5(a) will produce correct labelings, while under LBP the exact opposite is required. [sent-156, score-0.452]

67 Instead, care must be taken to ensure that learning and inference are appropriately matched. [sent-159, score-0.288]

68 In particular, it is generally invalid to assume that an arbitrary choice of approximate inference will lead to useful results when the learning method expects exact feedback. [sent-160, score-0.409]

69 5 Learning bounds for approximate inference In contrast to the failure of LBP in Section 4, appropriate pairs of inference and learning algorithms do exist. [sent-161, score-0.627]

70 We give two bounds using LP-relaxed inference for MRFs with log-linear potentials. [sent-162, score-0.289]

71 First, under the assumption of algorithmic separability, we show that the structured perceptron of Collins [9] makes only a ﬁnite number of mistakes. [sent-163, score-0.457]

72 Then replacing integrality constraints in P with box constraints 0 ≤ zi ≤ 1 yields an LP with a feasible polytope having vertices Z ⊇ Z. [sent-172, score-0.334]

73 We can encode the MAP inference problem for MRFs as an integer program over indicators z with objective w · Φ(x, z) for some Φ linear in z (see, for example, [6]). [sent-178, score-0.329]

74 By Claim 1 and the fact that an optimal vertex always exists, LP-relaxed inference given an input x computes LPw (x) = arg max w · Φ(x, z) . [sent-179, score-0.336]

75 (6) z∈Z (x) We can think of this as exact inference over an expanded set of labelings Z (x), some of which may not be valid (i. [sent-180, score-0.53]

76 Given a sequence of input/labeling pairs {(xi , zi )}, suppose that there exists a weight vector w∗ with unit norm and γ > 0 such that, for all i, w∗ · (Φ(xi , zi ) − Φ(xi , z)) ≥ γ for all z ∈ Z (xi ) \ {zi }. [sent-186, score-0.261]

77 Then the structured perceptron makes at most R2 /γ 2 mistakes. [sent-189, score-0.349]

78 Let wk be the weight vector before the kth mistake; w1 = 0. [sent-191, score-0.279]

79 If (xk , zk ) is the instance on which the kth update occurs and zLP(k) = LPwk (xk ), then by the update rule, wk+1 2 = wk 2 + 2wk · (Φ(xk , zk ) − Φ(xk , zLP(k) )) + Φ(xk , zk ) − Φ(xk , zLP(k) ) ≤ wk 2 2 + R2 . [sent-194, score-0.502]

80 (7) The inequality follows from the fact that LP-relaxed inference maximizes w · Φ(xk , z) over all z ∈ Z (xk ), so the middle term is nonpositive. [sent-195, score-0.257]

81 2 PAC-Bayes The perceptron bound applies when data are perfectly algorithmically separable, but we might also hope to use LP-relaxed inference in the presence of noisy or otherwise almost-separable data. [sent-199, score-0.685]

82 The following theorem adapts an empirical risk minimization bound using the PAC-Bayes framework to show that LP-relaxed inference can also be used to learn successfully in these cases. [sent-200, score-0.397]

83 The measure of empirical risk for a weight vector w over a sample S = (x1 , . [sent-201, score-0.242]

84 6 Related work A number of authors have applied inference approximations to a wide range of learning problems, sometimes with theoretical analysis of approximation quality and often with good empirical results [8, 12, 3]. [sent-214, score-0.29]

85 [13] developed a method for using a linear model to make decisions during a search-based approximate inference process. [sent-217, score-0.338]

86 They showed that perceptron updates give rise to a mistake bound under the assumption that parameters leading to correct decisions exist. [sent-218, score-0.343]

87 Wainright [14] proved that when approximate inference is required at test time due to computational constraints, using an inconsistent (approximate) estimator for learning can be beneﬁcial. [sent-220, score-0.338]

88 In contrast, we consider learning algorithms that use identical inference for both training and testing, minimizing a general measure of empirical risk rather than maximizing data likelihood, and argue for compatibility between the learning method and inference process. [sent-222, score-0.623]

89 They show that this method can outperform exact inference learning when algorithmic separability holds precisely because approximation reduces expressivity; i. [sent-225, score-0.7]

90 When the data are not algorithmically separable, exact inference provides better performance if a large enough sample is available. [sent-228, score-0.497]

91 7 Conclusion Effective use of approximate inference for learning depends on two considerations that are irrelevant for prediction. [sent-231, score-0.338]

92 First, the expressivity of approximate inference, and consequently the bias for learning, can vary signiﬁcantly from that of exact inference. [sent-232, score-0.253]

93 Second, learning algorithms can misinterpret feedback received from approximate inference methods, leading to poor results or even divergence. [sent-233, score-0.338]

94 However, when algorithmic separability holds, the use of LP-relaxed inference with standard learning frameworks yields provably good results. [sent-234, score-0.629]

95 Future work includes the investigation of alternate inference methods that, while potentially less suitable for prediction alone, give better feedback for learning. [sent-235, score-0.257]

96 Conversely, learning methods that are tailored speciﬁcally to particular inference algorithms might show improved performance over those that assume exact inference. [sent-236, score-0.328]

97 Finally, the notion of algorithmic separability and the ways in which it might relate (through approximation) to traditional separability deserve further study. [sent-237, score-0.636]

98 The computational complexity of probabilistic inference using Bayesian belief networks (research note). [sent-240, score-0.257]

99 A linear programming formulation for global inference in natural language tasks. [sent-272, score-0.257]

100 Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. [sent-279, score-0.259]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lbp', 0.395), ('separability', 0.264), ('perceptron', 0.259), ('inference', 0.257), ('yj', 0.217), ('labelings', 0.202), ('ij', 0.183), ('yi', 0.18), ('algorithmically', 0.169), ('mab', 0.127), ('mba', 0.127), ('mbb', 0.127), ('separable', 0.12), ('labeling', 0.114), ('wk', 0.112), ('algorithmic', 0.108), ('expressivity', 0.101), ('lpw', 0.101), ('node', 0.096), ('weight', 0.095), ('potentials', 0.094), ('messages', 0.092), ('structured', 0.09), ('mrfs', 0.089), ('feasible', 0.089), ('approximate', 0.081), ('mij', 0.081), ('labels', 0.079), ('nodes', 0.079), ('inexact', 0.076), ('zlp', 0.076), ('risk', 0.076), ('guration', 0.072), ('exact', 0.071), ('xk', 0.071), ('mrf', 0.062), ('zk', 0.062), ('xij', 0.061), ('lp', 0.06), ('instance', 0.058), ('sign', 0.057), ('ex', 0.056), ('map', 0.053), ('roth', 0.053), ('scoring', 0.052), ('loopy', 0.052), ('counterexamples', 0.051), ('maxij', 0.051), ('mislabeling', 0.051), ('mistake', 0.051), ('dim', 0.049), ('suitably', 0.049), ('zi', 0.047), ('edge', 0.046), ('associative', 0.045), ('fractional', 0.045), ('correctly', 0.044), ('kulesza', 0.044), ('integrality', 0.044), ('adapted', 0.043), ('integer', 0.043), ('separating', 0.042), ('compatible', 0.042), ('loss', 0.041), ('hw', 0.04), ('vertex', 0.04), ('constraints', 0.039), ('arg', 0.039), ('box', 0.038), ('vector', 0.038), ('fernando', 0.038), ('polytope', 0.038), ('annotated', 0.037), ('con', 0.036), ('preference', 0.036), ('collins', 0.036), ('suppose', 0.034), ('kth', 0.034), ('margin', 0.034), ('andrew', 0.034), ('letting', 0.034), ('empirical', 0.033), ('xi', 0.033), ('score', 0.033), ('impossible', 0.033), ('message', 0.033), ('correct', 0.033), ('label', 0.033), ('seeks', 0.032), ('characterizes', 0.032), ('pereira', 0.032), ('bounds', 0.032), ('allowed', 0.032), ('learn', 0.031), ('appropriately', 0.031), ('reasoning', 0.03), ('iterating', 0.03), ('conference', 0.029), ('program', 0.029), ('charles', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 187 nips-2007-Structured Learning with Approximate Inference

Author: Alex Kulesza, Fernando Pereira

2 0.15301384 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Author: Charlie Frogner, Avi Pfeffer

Abstract: Dynamic Bayesian networks are structured representations of stochastic processes. Despite their structure, exact inference in DBNs is generally intractable. One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. In addition to features computed over entire factors, for efﬁciency we explored scores computed over pairs of variables. We present search methods that use these features, pairwise and not, to ﬁnd a factorization, and we compare their results on several datasets. Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. Moreover, tests on real DBNs show that automatic factorization can achieve signiﬁcantly lower error in some cases. 1

3 0.15089348 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to ﬁnd the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efﬁciency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a ﬁxed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the ﬁxed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will ﬁnd the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } deﬁned as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, deﬁne the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is deﬁned as ﬁnding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is deﬁned as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, ﬁx all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider ﬁxing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-ﬁxed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we deﬁned λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are ﬁxed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisﬁed: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x deﬁned as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being ﬁxed outside the star implies that λ−i (xj ) is ﬁxed. Deﬁne: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the deﬁnition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a ﬁxed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difﬁculty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP ﬁxed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the ﬁxed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it sufﬁces to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the ﬁxed point of MPLP. Then the dual objective satisﬁes λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP ﬁxed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, ﬁxed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP ﬁxed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) deﬁned via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To deﬁne the LP we introduce the following deﬁnitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are ﬁxed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be:   1 1 λ−c (xs ) + max  λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s deﬁned as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the ﬁxed points of a class of max-product like algorithms. Their analysis focused on properties of ﬁxed points rather than convergence guarantees. Speciﬁcally, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its ﬁxed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP ﬁxed points, and in fact we can show that MPLP satisﬁes the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their ﬁxed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its ﬁxed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is speciﬁcally tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not ﬁnd the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is deﬁned as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We deﬁne the hit-time as the ﬁrst iteration at which the maximum value of f (x) is achieved.4 We ﬁrst experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of ﬁnding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All ﬁgures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in signiﬁcant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we ﬁrst express the constraint set ML (G) in a slightly different way. The deﬁnition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artiﬁcial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8

4 0.14230573 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efﬁciently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (ﬁrst-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine ﬂexiblity with analytical and computational tractability, on the other hand, online algorithms provide efﬁcient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the speciﬁc task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than ﬁrst-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and ﬂexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simpliﬁed form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efﬁcient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the ﬂy between higher-order and ﬁrst-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reﬂects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artiﬁcial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm deﬁned in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental ﬁndings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneﬁcial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the speciﬁc task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise speciﬁed), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classiﬁer u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to reﬁne Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efﬁcient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we deﬁne wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classiﬁer over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisﬁes3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We ﬁx an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semideﬁnite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it sufﬁces to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To ﬁnd a lower bound of the left-hand side of (3), we ﬁrst pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more reﬁned bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classiﬁed by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to ﬁrst order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (ﬁrst-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artiﬁcial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as ﬁrst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artiﬁcial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We ﬁrst tested on a couple of well-known DNA microarray datasets. For each dataset, we ﬁrst generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression proﬁles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression proﬁles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression proﬁles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artiﬁcially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by ﬁrst generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the ﬁrst 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the ﬁrst target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are ﬂipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artiﬁcial 0.0 ” and ”Artiﬁcial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the ﬁrst 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classiﬁcation problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the ﬁrst 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not signiﬁcantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the speciﬁc task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artiﬁcial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) signiﬁcantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its signiﬁcant overlap with the other ones. Notice that this setting fulﬁlls the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassiﬁed patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered signiﬁcant. In bold are the smallest ﬁgures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artiﬁcial datasets (Artiﬁcial0.0 , on the left, and Artiﬁcial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artiﬁcial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difﬁculty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. siﬁcation, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efﬁcient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classiﬁers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classiﬁcation tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a ﬁxed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classiﬁcation algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.

5 0.12331227 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efﬁciently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are signiﬁcantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for ﬁnding the MAP assignment in protein structure prediction. 1

6 0.12249064 128 nips-2007-Message Passing for Max-weight Independent Set

7 0.10937785 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

8 0.10336293 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

9 0.098604344 97 nips-2007-Hidden Common Cause Relations in Relational Learning

10 0.093708336 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

11 0.089801297 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

12 0.089760706 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

13 0.077692762 32 nips-2007-Bayesian Co-Training

14 0.077031821 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

15 0.076031774 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

16 0.075798713 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

17 0.073599711 62 nips-2007-Convex Learning with Invariances

18 0.072858989 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

19 0.072652273 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

20 0.072254159 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.244), (1, -0.002), (2, -0.114), (3, 0.065), (4, -0.055), (5, -0.18), (6, 0.07), (7, 0.07), (8, 0.086), (9, -0.116), (10, 0.021), (11, -0.017), (12, -0.102), (13, 0.114), (14, 0.068), (15, 0.037), (16, -0.074), (17, 0.011), (18, -0.015), (19, -0.02), (20, 0.136), (21, -0.067), (22, 0.071), (23, 0.045), (24, -0.052), (25, 0.025), (26, 0.003), (27, -0.018), (28, 0.042), (29, -0.136), (30, 0.024), (31, 0.034), (32, -0.075), (33, -0.023), (34, -0.025), (35, -0.025), (36, -0.065), (37, 0.036), (38, 0.138), (39, -0.015), (40, 0.099), (41, -0.051), (42, -0.035), (43, -0.027), (44, -0.094), (45, -0.083), (46, -0.093), (47, -0.054), (48, 0.024), (49, -0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95570654 187 nips-2007-Structured Learning with Approximate Inference

Author: Alex Kulesza, Fernando Pereira

2 0.66297507 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Author: Charlie Frogner, Avi Pfeffer

3 0.62204331 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

4 0.61809051 128 nips-2007-Message Passing for Max-weight Independent Set

Author: Sujay Sanghavi, Devavrat Shah, Alan S. Willsky

Abstract: We investigate the use of message-passing algorithms for the problem of ﬁnding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufﬁcient conditions for correctness of the estimate. We then develop a modiﬁcation of max-product – one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms in conjunction is correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over ﬁnite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation. 1

5 0.60458243 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

Author: Kyomin Jung, Devavrat Shah

Abstract: We present a new local approximation algorithm for computing MAP and logpartition function for arbitrary exponential family distribution represented by a ﬁnite-valued pair-wise Markov random ﬁeld (MRF), say G. Our algorithm is based on decomposing G into appropriately chosen small components; computing estimates locally in each of these components and then producing a good global solution. We prove that the algorithm can provide approximate solution within arbitrary accuracy when G excludes some ﬁnite sized graph as its minor and G has bounded degree: all Planar graphs with bounded degree are examples of such graphs. The running time of the algorithm is Θ(n) (n is the number of nodes in G), with constant dependent on accuracy, degree of graph and size of the graph that is excluded as a minor (constant for Planar graphs). Our algorithm for minor-excluded graphs uses the decomposition scheme of Klein, Plotkin and Rao (1993). In general, our algorithm works with any decomposition scheme and provides quantiﬁable approximation guarantee that depends on the decomposition scheme.

6 0.57879192 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

7 0.50623572 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

8 0.50426871 97 nips-2007-Hidden Common Cause Relations in Relational Learning

9 0.48548892 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

10 0.47825953 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

11 0.46920887 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

12 0.45406777 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

13 0.44656712 146 nips-2007-On higher-order perceptron algorithms

14 0.44494164 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

15 0.4290002 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

16 0.42376548 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

17 0.41882241 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

18 0.4047991 62 nips-2007-Convex Learning with Invariances

19 0.39804062 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

20 0.39434263 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.078), (13, 0.044), (16, 0.022), (18, 0.013), (21, 0.083), (31, 0.021), (34, 0.049), (35, 0.02), (47, 0.085), (49, 0.011), (70, 0.201), (83, 0.179), (85, 0.03), (87, 0.011), (90, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85988599 187 nips-2007-Structured Learning with Approximate Inference

Author: Alex Kulesza, Fernando Pereira

2 0.75819546 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Author: Matthias Bethge, Philipp Berens

Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1

3 0.75781131 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semideﬁnite relaxation that yields global training by eliminating local minima. 1

4 0.75158298 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

5 0.74990815 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum

Abstract: We present a simple new criterion for classiﬁcation, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classiﬁers. Theoretical results provide new insights into relationships among popular classiﬁers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classiﬁcation criterion and its kernel and local versions perform competitively against existing classiﬁers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-speciﬁc information. 1

6 0.74945527 128 nips-2007-Message Passing for Max-weight Independent Set

7 0.74656522 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

8 0.74640614 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

9 0.74523723 115 nips-2007-Learning the 2-D Topology of Images

10 0.74357128 86 nips-2007-Exponential Family Predictive Representations of State

11 0.74355376 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

12 0.74314874 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

13 0.74253249 7 nips-2007-A Kernel Statistical Test of Independence

14 0.74217331 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

15 0.74173838 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

16 0.741274 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

17 0.74038845 156 nips-2007-Predictive Matrix-Variate t Models

18 0.73969042 141 nips-2007-New Outer Bounds on the Marginal Polytope

19 0.73946029 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

20 0.73895669 49 nips-2007-Colored Maximum Variance Unfolding