nips nips2010 nips2010-64 knowledge-graph by maker-knowledge-mining

64 nips-2010-Distributionally Robust Markov Decision Processes

Source: pdf

Author: Huan Xu, Shie Mannor

Abstract: We consider Markov decision processes where the values of the parameters are uncertain. This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different conﬁdence level so that a set of admissible probability distributions of the unknown parameters is speciﬁed. This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the conﬁdence region of parameters abound. We propose a decision criterion based on distributional robustness: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i.e., it agrees with the a-priori information). We show that ﬁnding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different conﬁdence level so that a set of admissible probability distributions of the unknown parameters is speciﬁed. [sent-8, score-0.26]

2 This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the conﬁdence region of parameters abound. [sent-9, score-0.202]

3 We propose a decision criterion based on distributional robustness: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i. [sent-10, score-0.599]

4 We show that ﬁnding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions. [sent-13, score-1.408]

5 In practice, parameter uncertainty – the deviation of the model parameters from the true ones (reward r and transition probability p) – often causes the performance of “optimal” policies to degrade signiﬁcantly [4]. [sent-15, score-0.231]

6 Many efforts have been made to reduce such performance variation under the robust MDP framework (e. [sent-16, score-0.286]

7 In this paper we extend the robust MDP framework to deal with probabilistic information on uncertain parameters. [sent-20, score-0.452]

8 If the passing time to area A is uncertain and can be very large, then the solution to robust MDP would tend to take a detour and avoid A. [sent-23, score-0.46]

9 Indeed, it was observed that since the robust MDP framework ignores probabilistic information, it can provide conservative solutions [11, 12]. [sent-26, score-0.308]

10 For example, in the path planning problem above, the decision maker may not know how to assign probabilities to the model dynamics when a storm occurs. [sent-29, score-0.285]

11 Our 1 approach offers a middle ground between the fully Bayesian approach and the robust approach: we want the decision maker to be able to use prior information but we do not require a complete Bayesian interpretation. [sent-30, score-0.422]

12 We adapt the distributionally robust approach to MDPs under parameter uncertainty. [sent-31, score-0.777]

13 The distributionally robust formulation has been extensively studied and broadly applied in single stage optimization problems to effectively incorporates a-priori probabilistic information of the unknown parameters (e. [sent-32, score-0.901]

14 In this framework, the uncertain parameters are regarded as stochastic, with a distribution µ that is not precisely observed, yet assumed to belong to an a-priori known set C. [sent-35, score-0.207]

15 That is, given a utility function u(x, ξ) where x ∈ X is the optimizing variable and ξ is the unknown parameter, distributionally robust optimization solves maxx∈X inf µ∈C Eξ∼µ u(x, ξ) . [sent-37, score-0.883]

16 Here the goal is to optimize a so-called coherent risk measure, which is shown to be equivalent to a distributionally robust formulation. [sent-39, score-0.777]

17 It is easy to see that the nominal approach and the robust approach suffers from normalcy bias and zero-risk bias, respectively. [sent-45, score-0.413]

18 We formulate and solve the distributionally robust MDP with respect to the nested uncertainty set. [sent-46, score-0.912]

19 The main contribution of this paper is showing that for both the ﬁnite horizon case and the discounted reward inﬁnite horizon case, such optimal policy satisﬁes a Bellman type equation, and can be solved via backward iteration. [sent-53, score-0.41]

20 For example, if the uncertainty set of a robust MDP is not precisely known, then one can instead solve distributionally robust MDP with a 2-set formulation where the inner and the outer sets represent, respectively, an “optimistic” estimation and a “conservative” estimation. [sent-60, score-1.209]

21 Following Puterman [1], we denote the set of all history-dependent randomized policies by ΠHR , and the set of all Markovian randomized policies by ΠMR . [sent-67, score-0.17]

22 , rs denotes the vector form of rewards associated with state s, and πs is the (randomized) action chosen at state s for policy π. [sent-70, score-0.456]

23 The elements in vector ps are listed in the following way: the transition probabilities of the same action are arranged in the same block, and inside each block they are listed according to the order of the next state. [sent-71, score-0.622]

24 For a policy π, we denote the expected (discounted) total-reward under parameters p, r by u(π, p, r), that is, T u(π, p, r) γ i−1 r(si , ai )}. [sent-76, score-0.228]

25 Ep { π i=1 In this paper we propose and solve distributionally robust policy under parameter uncertainty, which incorporates a-prior information of how parameters are distributed. [sent-77, score-0.967]

26 We evaluate each policy by its expected performance under the (respective) most adversarial distribution of the uncertain parameters, and a distributionally robust policy is the optimal policy according to this measure. [sent-79, score-1.502]

27 A policy π ∗ ∈ π HR is distributionally robust with respect to CS if it satisﬁes that for all π ∈ ΠHR , inf u(π, p, r) dµ(p, r) ≤ ′inf µ ∈CS µ∈CS u(π ∗ , p, r) dµ′ (p, r). [sent-81, score-0.998]

28 Next we specify the set of admissible distributions of uncertain parameters CS investigated in this 1 2 n paper. [sent-82, score-0.223]

29 For a state s, the condition µs (Ps ) = 1 means that the unknown parameters (ps , rs ) are restricted to the outermost uncertainty set; and the condition i i 1 n µs (Ps ) ≥ λi means that with probability at least λi , (ps , rs ) ∈ Ps . [sent-87, score-0.445]

30 Thus, Ps , · · · , Ps provides probabilistic guarantees of (ps , rs ) for n different uncertainty sets (or equivalently conﬁdence levels). [sent-88, score-0.25]

31 In this section we show how to solve distributionally robust policies to MDPs having ﬁnitely many decision stages. [sent-92, score-0.912]

32 Thus, we can partition S according to the stage each state belongs to, and let St be the set of states belong to tth stage. [sent-95, score-0.179]

33 (i) Each state belongs to only one stage; (ii) the terminal reward equals zero; and (iii) the ﬁrst stage only contains one state sini . [sent-98, score-0.312]

34 We next deﬁne sequentially robust policies through a backward induction as a policy that is robust in every step for a standard robust MDP. [sent-99, score-1.367]

35 We will later shows that sequentially robust policies are also distributionally robust by choosing the uncertainty set of the robust MDP carefully. [sent-100, score-1.741]

36 For s ∈ ST , the sequentially robust value vT (s) ˜ 0. [sent-104, score-0.505]

37 For s ∈ St where t < T , the sequentially robust value vt (s) and sequentially robust action ˜ πs are deﬁned as ˜ vt (s) ˜ max πs ∈∆(s) πs ∈ arg max ˜ πs ∈∆(s) min (ps ,rs )∈Ps min (ps ,rs )∈Ps s Eps [r(s, a) + γ˜t+1 (s)] . [sent-106, score-1.155]

38 Ps if ∀s ∈ S, πs is a sequentially robust ˜ ˜∗ action. [sent-112, score-0.505]

39 A standard game theoretic argument implies that sequentially robust actions, and hence sequentially robust policies, exist. [sent-113, score-1.01]

40 Indeed, from the literature in robust MDP (cf [5, 7, 8]) it is easy to see that a sequentially robust policy is the solution to the robust MDP where the uncertainty set is s Ps . [sent-114, score-1.354]

41 The following theorem, which is the main result of this paper, shows that any sequentially robust policy (w. [sent-115, score-0.678]

42 a speciﬁc uncertainty set) π ∗ is distributionally robust. [sent-118, score-0.595]

43 Let Assumption 1 hold, and suppose that π ∗ is a sequentially robust ˆ policy w. [sent-121, score-0.678]

44 s Ps , where n ˆ Ps = { i (λi − λi−1 )(rs (i), ps (i))|(ps (i), rs (i)) ∈ Ps }. [sent-124, score-0.614]

45 π ∗ is a distributionally robust policy with respect to Cs ; and 2. [sent-126, score-0.95]

46 Therefore, to ﬁnd the sequentially robust policy, we need only to solve the sequentially robust action. [sent-129, score-1.01]

47 For s ∈ St where t < T , the sequentially robust action is given by n q∗ = arg max q∈∆(s) (λi − λi−1 ) i=1 min i (pi ,ri )∈Ps s s ˜ (ri )⊤ q + (pi )⊤ Vs q s s , (2) ˜ where m = |As |, vt+1 is the vector form of vt+1 (s′ ) for all s′ ∈ St+1 , and ˜   ˜ vt+1 e⊤ (m) 1 ˜ . [sent-132, score-0.56]

48 : Vs  ⊤ ˜ vt+1 em (m) Theorem 2 implies that the computation of the sequentially robust action at a state s critically dei pends on the structure of the sets Ps . [sent-133, score-0.612]

49 In fact, it can be shown that for “good” uncertainty sets, computing the sequentially robust action is tractable. [sent-134, score-0.664]

50 The sequentially robust action for state s can be found in polynomial-time, if for each i i = 1, · · · , n, Ps has a polynomial separation oracle. [sent-138, score-0.633]

51 Thus the 4 ˆ distributionally robust MDP reduces to the robust MDP with s∈S Ps being the uncertainty set. [sent-144, score-1.167]

52 Finally, by applying results from robust MDPs we prove the theorem. [sent-145, score-0.286]

53 Let ht denote a history up to stage t and s(ht ) denote the last state of history ht . [sent-147, score-0.535]

54 We use πht (a) to represent the probability of choosing an action a at state s(ht ), following a policy π and under a history ht . [sent-148, score-0.504]

55 A t + 1 stage history, with ht followed by action a and state s′ is written as (ht , a, s′ ). [sent-149, score-0.334]

56 With an abuse of notation, we denote the expected reward-to-go under a history as: T u(π, p, r, ht ) γ i−t r(si , ai )|(s1 , a1 · · · , st ) = ht }. [sent-150, score-0.488]

57 Ep { π i=t For π ∈ ΠHR and µ ∈ CS (λ), deﬁne w(π, µ, ht ) E(p,r)∼µ us (π, p, r, h(t)) = u(π, p, r, h(t))dµ(p, r). [sent-151, score-0.192]

58 Fix π ∈ ΠHR , µ ∈ CS and a history ht where t < T , denote r = Eµ (r), p = Eµ (p), then we have: w(π, µ, ht ) = γp s′ |s(ht ), a w π, µ, (ht , a, s′ ) πht (a) r s(ht ), a + a∈As(ht ) γp s′ |s(ht ), a w π, µ, (ht , a, s′ ) πht (a) r s(ht ), a + = dµs(ht ) (ps(ht ) , rs(ht ) ) s′ ∈S . [sent-155, score-0.416]

59 s′ ∈S a∈As(ht ) From Lemma 1, by backward induction, one can show the following lemma holds, which essentially means that for any policy, the expected performance under an admissible distribution µ only depends on the expected value of the parameters under µ. [sent-156, score-0.213]

60 Thus, the distributionally robust MDP reduces to a robust MDP. [sent-157, score-1.063]

61 We complete the proof of the Theorem 1 using the equivalence of distributionally robust MDPs and robust MDPs where the unˆ ˆ certainty set is s∈S Ps . [sent-165, score-1.063]

62 It is well known that for robust MDPs, a saddle point of the minimax objective exists (cf [5, 8]). [sent-167, score-0.343]

63 ∗ ∗ ∗ Moreover, π ∗ and (p∗ , r∗ ) can be constructed state-wise: π ∗ = s∈S πs and (p , r ) = ∗ ∗ ∗ ∗ ∗ s∈S (ps , rs ), and for each s ∈ St , πs , (ps , rs ) solves the following zero-sum game max πs min ˆ (ps ,rs )∈Ps Eps r(s, a) + γ˜t+1 (s) . [sent-169, score-0.248]

64 v π ∗ It follows that πs is any sequentially robust action, and hence π ∗ can be any sequentially robust policy. [sent-170, score-1.01]

65 4 Distributionally robust MDP: The discounted reward inﬁnite horizon case. [sent-180, score-0.42]

66 In this section we show how to compute a distributionally robust policy for inﬁnite horizon MDPs. [sent-181, score-1.024]

67 Speciﬁcally, we generalize the notion of sequentially robust policies to discounted-reward inﬁnitehorizon MDPs, and show that it is distributionally robust in an appropriate sense. [sent-182, score-1.351]

68 Ps if ∀s ∈ S, πs is a sequentially robust ˜ ˜∗ action. [sent-200, score-0.505]

69 The sequentially robust policy is well deﬁned, since the following operator L : R|S| → R|S| is a γ contraction for · ∞ norm. [sent-201, score-0.678]

70 Hence, by applying L on any initial v0 ∈ R|S| repeatedly, the ˜ resulting value vector will converge to the sequentially robust value v exponentially fast. [sent-204, score-0.505]

71 Therefore, we consider an equivalent MDP with an augmented state space, where each augmented state is deﬁned by a pair (s, t) where s ∈ S and t, meaning state s in the tth horizon. [sent-208, score-0.224]

72 That is, if a state s is visited for multiple times, then each time the distribution (of uncertain parameters) µs is the same. [sent-212, score-0.233]

73 s∈S,t=1,2,··· The next theorem is the main result of this section; it shows that a sequentially robust policy is distributionally robust to both stationary and non-stationary models. [sent-215, score-1.509]

74 Given T = ∞ and γ < 1, any sequentially robust policy w. [sent-217, score-0.678]

75 s Ps where Ps = n i ¯∞ { i=1 (λi − λi−1 )(rs (i), ps (i))|(ps (i), rs (i)) ∈ Ps }, is distributionally robust with respect to CS , ¯S . [sent-220, score-1.391]

76 , a ﬁnite horizon problem that stops at stage T with a termiˆ consider a T nation reward v∞ (·), and show that the optimal strategy for this problem, which is a sequential ˜ robust strategy, coincides with that of the inﬁnite horizon one. [sent-224, score-0.506]

77 Indeed, given any sequential robust strategy π ∗ , one can construct a stationary distribution µ∗ such that (π ∗ , µ∗ ) is a saddle point for ¯ ¯ ¯∞ ¯ supπ∈ΠHR inf µ∈CS u(π, p, r) dµ(p, r). [sent-225, score-0.394]

78 ¯∞ We remark that the decision maker is allowed to take non-stationary strategies, although the distributionally robust solution is proven to be stationary. [sent-227, score-0.913]

79 These two formulations model different setups: if the system, more speciﬁcally the distribution of uncertain parameters, evolves with time, then the non-stationary model is more appropriate; while if the system is static, then the stationary model is preferable. [sent-229, score-0.198]

80 We consider a path planning problem: an agent wants to exit a 4 × 21 maze (shown in Figure 1) using the least possible time. [sent-236, score-0.226]

81 The ﬁrst one is the uncertain cost case, where the true (yet unknown to the planning agent) time for the agent to pass through a “shaky” place equals ˜ ˜ x = 1 + e(λ), and e(λ) is an exponential distributed random variable with parameter λ. [sent-241, score-0.296]

82 , 1) as the parameter; the robust approach takes [1, 1 + 3/λ] as the uncertainty set; and the distributional robust approach takes into account the additional information that Pr(x ∈ [1, 1 + log 2/λ]) ≥ 0. [sent-244, score-0.714]

83 The second case is the uncertain transition case: if an agent reaches a “shaky” place, then the transition becomes unpredictable – in the next step with probability 20% it will make an (unknown) 7 jump. [sent-250, score-0.288]

84 The distributionally robust approach takes into account an additional information that if a jump happens, the probability that it jumps to a spot that is left to the current place is no more than γ. [sent-255, score-0.839]

85 Each policy is tested over 300 runs, while the true jump is set as with probability 0. [sent-256, score-0.197]

86 (a) Uncertain cost (b) Uncertain transition 90 90 nominal robust dis. [sent-260, score-0.403]

87 80 70 Time to Exit 70 Time to Exit nominal robust dis. [sent-262, score-0.362]

88 In both the uncertain cost and the uncertain transition probability setups, the distributionally robust approach outperforms the other two approach over virtually the whole range of parameters. [sent-274, score-1.106]

89 This is well expected, since additional probabilistic information is available to and incorporated by the distributionally robust approach. [sent-275, score-0.799]

90 6 Concluding remarks In this paper we proposed a distributionally robust approach to mitigate the conservatism of the robust MDP framework and incorporate additional a-prior probabilistic information regarding the unknown parameters. [sent-276, score-1.109]

91 We proposed to ﬁnd a policy that achieves maximum expected utility under the worst admissible distribution of the parameters. [sent-278, score-0.307]

92 Such formulation leads to a policy that is obtained through a Bellman type backward induction, and can be solved in polynomial time under mild technical conditions. [sent-279, score-0.249]

93 A different perspective on our work is that we develop a principled approach to the problem of uncertainty set design in multi-stage decision problems. [sent-280, score-0.17]

94 We provide a principled approach to the problem of uncertainty set selection: the distributionally robust policy is a robust policy w. [sent-282, score-1.513]

95 A natural question is how can we take advantage of the distributionally robust approach and solve (exactly) a full-blown Bayesian generative model MDP? [sent-286, score-0.777]

96 , increasing n) is that of representation: the equivalent robust MDP uncertainty set may become too complicated to represent efﬁciently. [sent-289, score-0.39]

97 Nevertheless, if it is possible to offer upper and lower bounds on the probability of each nested sets (based on the generative model), the corresponding distributionally robust policies provide performance bounds on the optimal policies in the, often intractable, Bayesian model. [sent-290, score-0.946]

98 Robust control of Markov decision processes with uncertain transition matrices. [sent-325, score-0.269]

99 Robustness in Markov decision problems with uncertain transition matrices. [sent-354, score-0.251]

100 Distributional robust optimization under moment uncertainty with applications to data-driven problems. [sent-396, score-0.39]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('distributionally', 0.491), ('ps', 0.49), ('cs', 0.292), ('robust', 0.286), ('sequentially', 0.219), ('ht', 0.192), ('policy', 0.173), ('mdp', 0.149), ('uncertain', 0.144), ('rs', 0.124), ('hr', 0.123), ('sini', 0.115), ('uncertainty', 0.104), ('nominal', 0.076), ('storm', 0.076), ('horizon', 0.074), ('maker', 0.07), ('policies', 0.069), ('decision', 0.066), ('eps', 0.063), ('agent', 0.062), ('admissible', 0.062), ('mdps', 0.061), ('action', 0.055), ('state', 0.052), ('planning', 0.049), ('inf', 0.048), ('exit', 0.047), ('vt', 0.045), ('nilim', 0.043), ('shaky', 0.043), ('transition', 0.041), ('imprecise', 0.038), ('maxmin', 0.038), ('distributional', 0.038), ('expected', 0.038), ('reward', 0.037), ('visited', 0.037), ('cf', 0.037), ('stage', 0.035), ('stationary', 0.035), ('st', 0.034), ('utility', 0.034), ('history', 0.032), ('minimax', 0.032), ('nested', 0.031), ('passing', 0.03), ('belong', 0.03), ('sup', 0.03), ('mannor', 0.03), ('backward', 0.029), ('lemma', 0.029), ('el', 0.029), ('normalcy', 0.029), ('markov', 0.028), ('operations', 0.027), ('formulation', 0.026), ('saddle', 0.025), ('dence', 0.025), ('augmented', 0.025), ('adversarial', 0.024), ('path', 0.024), ('unknown', 0.024), ('jump', 0.024), ('states', 0.023), ('delage', 0.023), ('shie', 0.023), ('wants', 0.023), ('termed', 0.023), ('discounted', 0.023), ('bias', 0.022), ('probabilistic', 0.022), ('nitely', 0.021), ('polynomial', 0.021), ('belongs', 0.021), ('spot', 0.021), ('maze', 0.021), ('nance', 0.021), ('fix', 0.02), ('induction', 0.019), ('formulations', 0.019), ('inventory', 0.019), ('theorem', 0.019), ('concluding', 0.018), ('etc', 0.018), ('tth', 0.018), ('setups', 0.018), ('stands', 0.018), ('xu', 0.018), ('stochastic', 0.018), ('listed', 0.018), ('processes', 0.018), ('place', 0.017), ('visits', 0.017), ('parameters', 0.017), ('mathematical', 0.017), ('nite', 0.017), ('randomized', 0.016), ('precisely', 0.016), ('ranked', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 64 nips-2010-Distributionally Robust Markov Decision Processes

Author: Huan Xu, Shie Mannor

2 0.18774842 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1

3 0.15387532 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

Abstract: We consider online learning in ﬁnite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the ﬁrst rigorously proved regret bound for the problem. 1

4 0.14326638 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1

5 0.13824965 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difﬁculty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufﬁciently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to ﬁnd a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difﬁcult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be deﬁned by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will ﬁnd some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary ﬁnite data set. To our knowledge, this is the ﬁrst result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classiﬁer uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is deﬁned as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to ﬁnd m=1 weights θ that correctly classify the training examples. Consider ﬁrst the separable case. Deﬁne the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is deﬁned by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while ﬁxing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while ﬁxing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random ﬁelds (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by conﬁgurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing conﬁgurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to ﬁnd a violated constraint. For the pseudo-max objective, ﬁnding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisﬁes particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identiﬁed to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simpliﬁed instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identiﬁability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to inﬁnity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four conﬁgurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufﬁciently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the conﬁgurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the ﬁgure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identiﬁability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction deﬁned on a ﬁxed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node ﬁelds θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k speciﬁes the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input ﬁelds xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufﬁcient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classiﬁcation constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input ﬁelds xi (yi ) already sufﬁce to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the ﬁrst example, we set the input ﬁelds for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input ﬁelds can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ ﬁeld for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is ﬁnite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for ﬁnite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Deﬁne Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any ﬁnite xi and ﬁnite J ∗ , the other xj ’s can be chosen so as to give the y conﬁguration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we deﬁne h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and deﬁne ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the ﬁrst term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function deﬁned in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difﬁculty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certiﬁcate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a ﬁxed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We ﬁrst construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The ﬁrst i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difﬁculty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subﬁgure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classiﬁcation task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not ﬂattened either), or because the consistency conditions are not fully satisﬁed: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random ﬁeld (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to ﬁnding the most likely assignment in a pairwise MRF, and ﬁts naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for ﬁnding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We signiﬁcantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Speciﬁcally, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly ﬁve times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difﬁculty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to ﬁnd parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufﬁcient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufﬁcient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufﬁciently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-speciﬁcation [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9

6 0.1379499 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

7 0.13722037 152 nips-2010-Learning from Logged Implicit Exploration Data

8 0.13654737 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

9 0.12824807 208 nips-2010-Policy gradients in linearly-solvable MDPs

10 0.11485062 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

11 0.11093578 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

12 0.10743821 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

13 0.10282134 168 nips-2010-Monte-Carlo Planning in Large POMDPs

14 0.099480502 134 nips-2010-LSTD with Random Projections

15 0.094590552 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

16 0.089141652 43 nips-2010-Bootstrapping Apprenticeship Learning

17 0.08635594 4 nips-2010-A Computational Decision Theory for Interactive Assistants

18 0.08572264 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

19 0.083403639 229 nips-2010-Reward Design via Online Gradient Ascent

20 0.081540778 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.163), (1, -0.236), (2, -0.021), (3, 0.0), (4, -0.004), (5, -0.001), (6, -0.021), (7, 0.005), (8, 0.009), (9, -0.002), (10, -0.026), (11, 0.007), (12, 0.011), (13, 0.015), (14, -0.023), (15, -0.01), (16, 0.021), (17, 0.009), (18, 0.012), (19, 0.006), (20, 0.042), (21, 0.015), (22, 0.055), (23, -0.006), (24, 0.037), (25, 0.018), (26, 0.001), (27, 0.014), (28, 0.09), (29, -0.043), (30, 0.054), (31, 0.156), (32, -0.056), (33, -0.032), (34, 0.068), (35, -0.016), (36, -0.084), (37, 0.136), (38, -0.071), (39, -0.122), (40, 0.088), (41, -0.117), (42, -0.166), (43, -0.07), (44, 0.002), (45, -0.072), (46, -0.106), (47, 0.127), (48, -0.033), (49, -0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95938432 64 nips-2010-Distributionally Robust Markov Decision Processes

Author: Huan Xu, Shie Mannor

2 0.63370115 168 nips-2010-Monte-Carlo Planning in Large POMDPs

Author: David Silver, Joel Veness

Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan eﬀectively in signiﬁcantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the ﬁrst general purpose planner to achieve high performance in such large and unfactored POMDPs. 1

3 0.55041218 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

4 0.52407104 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum

Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1

5 0.50952971 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “ofﬂine” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

6 0.50687492 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

7 0.50578219 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

8 0.49852252 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

9 0.47004104 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

10 0.46832576 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

11 0.46380618 4 nips-2010-A Computational Decision Theory for Interactive Assistants

12 0.45659393 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

13 0.44625941 212 nips-2010-Predictive State Temporal Difference Learning

14 0.44598052 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

15 0.41617253 208 nips-2010-Policy gradients in linearly-solvable MDPs

16 0.40101284 43 nips-2010-Bootstrapping Apprenticeship Learning

17 0.4006803 269 nips-2010-Throttling Poisson Processes

18 0.39808476 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

19 0.39248806 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

20 0.38965535 229 nips-2010-Reward Design via Online Gradient Ascent

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.035), (27, 0.061), (30, 0.05), (35, 0.01), (45, 0.197), (50, 0.048), (52, 0.083), (53, 0.028), (60, 0.04), (77, 0.043), (78, 0.019), (82, 0.254), (90, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81814808 64 nips-2010-Distributionally Robust Markov Decision Processes

Author: Huan Xu, Shie Mannor

2 0.73782706 27 nips-2010-Agnostic Active Learning Without Constraints

Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu

Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classiﬁcation. 1

3 0.67069036 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of ﬁrst-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We deﬁne appropriately restricted versions of these conditions, and show that they are satisﬁed with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s ﬁrst-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for speciﬁc methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational eﬃciency in high-dimensional estimation. 1

4 0.66812581 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

5 0.66546226 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

Author: Dahua Lin, Eric Grimson, John W. Fisher

Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1

6 0.66522205 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes

7 0.66452277 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

8 0.66334975 148 nips-2010-Learning Networks of Stochastic Differential Equations

9 0.66272879 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

10 0.66258597 280 nips-2010-Unsupervised Kernel Dimension Reduction

11 0.66252977 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

12 0.66221815 63 nips-2010-Distributed Dual Averaging In Networks

13 0.66129184 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data

14 0.66120791 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

15 0.66119492 265 nips-2010-The LASSO risk: asymptotic results and real world examples

16 0.6610958 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

17 0.6604079 117 nips-2010-Identifying graph-structured activation patterns in networks

18 0.66005552 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

19 0.65952456 287 nips-2010-Worst-Case Linear Discriminant Analysis

20 0.65797114 158 nips-2010-Learning via Gaussian Herding