nips nips2010 nips2010-197 knowledge-graph by maker-knowledge-mining

197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

Source: pdf

Author: Paolo Viappiani, Craig Boutilier

Abstract: Bayesian approaches to utility elicitation typically adopt (myopic) expected value of information (EVOI) as a natural criterion for selecting queries. However, EVOI-optimization is usually computationally prohibitive. In this paper, we examine EVOI optimization using choice queries, queries in which a user is ask to select her most preferred product from a set. We show that, under very general assumptions, the optimal choice query w.r.t. EVOI coincides with the optimal recommendation set, that is, a set maximizing the expected utility of the user selection. Since recommendation set optimization is a simpler, submodular problem, this can greatly reduce the complexity of both exact and approximate (greedy) computation of optimal choice queries. We also examine the case where user responses to choice queries are error-prone (using both constant and mixed multinomial logit noise models) and provide worst-case guarantees. Finally we present a local search technique for query optimization that works extremely well with large outcome spaces.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Bayesian approaches to utility elicitation typically adopt (myopic) expected value of information (EVOI) as a natural criterion for selecting queries. [sent-5, score-0.311]

2 In this paper, we examine EVOI optimization using choice queries, queries in which a user is ask to select her most preferred product from a set. [sent-7, score-0.313]

3 We show that, under very general assumptions, the optimal choice query w. [sent-8, score-0.187]

4 EVOI coincides with the optimal recommendation set, that is, a set maximizing the expected utility of the user selection. [sent-11, score-0.564]

5 Since recommendation set optimization is a simpler, submodular problem, this can greatly reduce the complexity of both exact and approximate (greedy) computation of optimal choice queries. [sent-12, score-0.28]

6 We also examine the case where user responses to choice queries are error-prone (using both constant and mixed multinomial logit noise models) and provide worst-case guarantees. [sent-13, score-0.288]

7 Finally we present a local search technique for query optimization that works extremely well with large outcome spaces. [sent-14, score-0.16]

8 1 Introduction Utility elicitation is a key component in many decision support applications and recommender systems, since appropriate decisions or recommendations depend critically on the preferences of the user on whose behalf decisions are being made. [sent-15, score-0.354]

9 Since full elicitation of user utility is prohibitively expensive in most cases (w. [sent-16, score-0.435]

10 A number of these focus directly on (myopically or heuristically) reducing uncertainty regarding utility parameters as quickly as possible, including max-margin [10], volumetric [12], polyhedral [22] and entropy-based [1] methods. [sent-23, score-0.212]

11 A different class of approaches does not attempt to reduce utility uncertainty for its own sake, but rather focuses on discovering utility information that improves the quality of the recommendation. [sent-24, score-0.408]

12 We focus on Bayesian models in this work, assuming some prior distribution over user utility parameters and conditioning this distribution on information acquired from the user (e. [sent-26, score-0.488]

13 Such queries are commonly used in conjoint analysis and product design [15], requiring a user to indicate which choice/product is most preferred from a set of k options. [sent-33, score-0.289]

14 We show that, under very general assumptions, optimization of choice queries reduces to the simpler problem of choosing the optimal recommendation set, i. [sent-34, score-0.319]

15 1 maximizes utility of that choice (in expectation). [sent-37, score-0.226]

16 Not only is the optimal recommendation set problem somewhat easier computationally, it is submodular, admitting a greedy algorithm with approximation guarantees. [sent-38, score-0.279]

17 We develop this connection under several different (noisy) user response models. [sent-40, score-0.209]

18 Finally, we describe query iteration, a local search technique that, though it has no formal guarantees, ﬁnds near-optimal recommendation sets and queries much faster than either exact or greedy optimization. [sent-41, score-0.472]

19 2 Background: Bayesian Recommendation and Elicitation We assume a system is charged with the task of recommending an option to a user in some multiattribute space, for instance, the space of possible product conﬁgurations from some domain (e. [sent-42, score-0.261]

20 The optimal product x∗ for a user with utility parameters w is the x ∈ X that maximizes u(x; w). [sent-63, score-0.366]

21 w Generally, a user’s utility function w will not be known with certainty. [sent-64, score-0.196]

22 Following recent models of Bayesian elicitation, the system’s uncertainty is reﬂected in a distribution, or beliefs, P (w; θ) over the space W of possible utility functions [7, 6, 2]. [sent-65, score-0.212]

23 Given P (·; θ), we deﬁne the expected utility of an option x to be EU (x; θ) = W u(x; w)P (w; θ)dw. [sent-67, score-0.307]

24 If required to make a recommendation given belief θ, the optimal option x∗ (θ) is that with greatest expected utility EU ∗ (θ) = maxx∈X EU (x; θ), with x∗ (θ) = arg maxx∈X EU (x; θ). [sent-68, score-0.542]

25 In some settings, we are able to make set-based recommendations: rather than recommending a single option, a small set of k options can be presented, from which the user selects her most preferred option [15, 20, 23]. [sent-69, score-0.386]

26 We discuss the problem of constructing an optimal recommendation set S further below. [sent-70, score-0.2]

27 Given recommendation set S with x ∈ S, let S ⊲ x denote that x has the greatest utility among those items in S (for a given utility function w). [sent-71, score-0.612]

28 Given feasible utility space W , we deﬁne W ∩ S ⊲ x ≡ {w ∈ W : u(x; w) ≥ u(y; w), ∀y = x, y ∈ S} to be those utility functions satisfying S ⊲ x. [sent-72, score-0.406]

29 Ignoring “ties” over full-dimensional subsets of W (which are easily dealt with, but complicate the presentation), the regions W ∩ S ⊲ xi , xi ∈ S, partition utility space. [sent-73, score-0.242]

30 A recommender system can reﬁne its belief state θ by learning more about the user’s utility function w. [sent-74, score-0.251]

31 While many sources of information can be used to assess a user preferences—including the preferences of related users, as in collaborative ﬁltering [14], or observed user choice behavior [15, 19]—we focus on explicit utility elicitation, in which a user is asked questions about her preferences. [sent-76, score-0.7]

32 There are a variety of query types that can be used to reﬁne one’s knowledge of a user’s utility function (we refer to [13, 3, 5] for further discussion). [sent-77, score-0.329]

33 Comparison queries are especially natural, asking a user if she prefers one option x to another y. [sent-78, score-0.312]

34 speciﬁc options to “generalize,” providing constraints on the utility of related options. [sent-82, score-0.287]

35 Any set S can be interpreted as a query: the user states which of the k elements xi ∈ S she prefers. [sent-84, score-0.169]

36 We refer to S interchangeably as a query or a choice set. [sent-85, score-0.163]

37 The user’s response to a choice set tells us something about her preferences; but this depends on the user response model. [sent-86, score-0.302]

38 More generally, a noisy response model allows that a user may select an option that does not maximize her utility. [sent-88, score-0.313]

39 For any choice set S with xi ∈ S, let S xi denote the event of the user selecting xi . [sent-89, score-0.245]

40 A response model R dictates, for any choice set S, the probability PR (S xi ; w) of any selection given utility function w. [sent-90, score-0.312]

41 When the beliefs about a user’s utility are uncertain, we deﬁne PR (S xi ; θ) = W PR (S xi ; w)P (w; θ)dw. [sent-91, score-0.262]

42 When treating S as a query set (as opposed to a recommendation set), we are not interested in its expected utility, but rather in its expected value of information (EVOI), or the (expected) degree to which a response will increase the quality of the system’s recommendation. [sent-93, score-0.416]

43 We deﬁne: Deﬁnition 1 Given belief state θ, the expected posterior utility (EPU ) of query set S under R is EPU R (S; θ) = PR (S x; θ) EU ∗ (θ|S x) (1) x∈S EVOI (S; θ) is then EPU (S; θ) − EU ∗ (θ), the expected improvement in decision quality given S. [sent-94, score-0.393]

44 In many settings, we may wish to present a set of options to a user with the dual goals of offering a good set of recommendations and eliciting valuable information about user utility. [sent-96, score-0.412]

45 For instance, product navigation interfaces for e-commerce sites often display a set of options from which a user can select, but also give the user a chance to critique the proposed options [24]. [sent-97, score-0.474]

46 This provides one motivation for exploring the connection between optimal recommendation sets and optimal query sets. [sent-98, score-0.378]

47 Moreover, even in settings where queries and recommendation are separated, we will see that query optimization can be made more efﬁcient by exploiting this relationship. [sent-99, score-0.398]

48 3 Optimal Recommendation Sets We consider ﬁrst the problem of computing optimal recommendation sets given the system’s uncertainty about the user’s true utility function w. [sent-100, score-0.433]

49 Given belief state θ, if a single recommendation is to be made, then we should recommend the option x∗ (θ) that maximizes expected utility EU (x, θ). [sent-101, score-0.503]

50 However, there is often value in suggesting a “shortlist” containing multiple options and allowing the user to select her most preferred option. [sent-102, score-0.285]

51 Intuitively, such a set should offer options that are diverse in the following sense: recommended options should be highly preferred relative to a wide range of “likely” user utility functions (relative to θ) [23, 20, 4]. [sent-103, score-0.572]

52 This stands in contrast to some recommender systems that deﬁne diversity relative to product attributes [21], with no direct reference to beliefs about user utility. [sent-104, score-0.221]

53 It is not hard to see that “top k” systems, those that present the k options with highest expected utility, do not generally result in good recommendation sets [20]. [sent-105, score-0.31]

54 In broad terms, we assume that the utility of a recommendation set S is the utility of its most preferred item. [sent-106, score-0.616]

55 In the ideal setting, users would always select the option with highest utility w. [sent-109, score-0.296]

56 Noisy response models admit user “mistakes” and the choice of optimal sets should reﬂect this possibility (just as belief update does, 3 see Defn. [sent-115, score-0.304]

57 We consider three different response models: • In the noiseless response model, RNL , we have PNL (S (with indicator function I). [sent-118, score-0.199]

58 • The constant noise model RC assumes a multinomial distribution over choices or responses where each option x, apart from the most preferred option x∗ relative to w, is selected with (small) w 1 constant probability PC (S x; w) = β, with β independent of w. [sent-122, score-0.25]

59 , |S| = 2), RL is the logistic function of the difference in utility between the two options. [sent-129, score-0.219]

60 We now consider properties of the expected utility of selection EUS under these various models. [sent-130, score-0.218]

61 EUS is monotone under the noiseless response model RNL : the addition of options to a recommendation set S cannot decrease its expected utility EUS NL (S; θ). [sent-132, score-0.621]

62 , an x dominated by some element of S) does not change expected utility under RNL : EUS NL (S ∪ {x}; θ) = EUS NL (S; θ). [sent-136, score-0.229]

63 This stands in constrast to noisy response models, where adding dominated options might actually decrease expected utility. [sent-137, score-0.202]

64 Importantly, EUS is submodular for both the noiseless and the constant response models RC : Theorem 1 For R ∈ {RN L , RC }, EUS R is a submodular function of the set S. [sent-138, score-0.212]

65 That is, given recommendation sets S ⊆ Q, option x ∈ S, S ′ = S ∪ {x}, and Q′ = Q ∪ {x}, we have: EUS R (S ′ ; θ) − EUS R (S; θ) ≥ EUS R (Q′ ; θ) − EUS R (Q; θ) (4) The proof is omitted, but simply shows that EUS has the required property of diminishing returns. [sent-139, score-0.286]

66 Submodularity serves as the basis for a greedy optimization algorithm (see Section 5 and worst-case results on query optimization below). [sent-140, score-0.222]

67 EUS under the commonly used logistic response model RL is not submodular, but can be related to EUS under the noiseless model—as we discuss next—allowing us to exploit submodularity of the noiseless model when optimizing w. [sent-141, score-0.265]

68 4 The Connection between EUS and EPU We now develop the connection between optimal recommendation sets (using EUS) and optimal choice queries (using EPU/EVOI). [sent-145, score-0.352]

69 This transformation is used in two ways: (i) to prove the optimality (near-optimality in the case of RL ) of EUS-optimal recommendation sets when used as query sets; (ii) and directly as a computationally viable heuristic strategy for generating query sets. [sent-148, score-0.463]

70 ” We ﬁrst show that optimal recommendation sets under both RNL and RC are optimal (i. [sent-158, score-0.245]

71 We now state the main theorem (we assume the size k of S is ﬁxed): Theorem 2 Assume response model R ∈ {N L, C} and let S ∗ be an optimal recommendation set. [sent-169, score-0.263]

72 Then S ∗ is an optimal query set: EPU (S ∗ ; θ) ≥ EPU (S; θ), ∀S ∈ Xk Proof: Suppose S ∗ is not an optimal query set, i. [sent-170, score-0.314]

73 Another consequence of Lemma 1 is that posing a query S involving an infeasible option is pointless: there is always a set with only elements in X with EPU/EVOI at least as great. [sent-177, score-0.234]

74 It is not hard to see that admitting noisy responses under the logistic response model RL can decrease the value of a recommendation set, i. [sent-179, score-0.315]

75 The logistic response model is such that, if the probability of incorrect selection of some option is high, then the utility of that option must be close to that of the best item, so the relative loss in utility is small. [sent-183, score-0.669]

76 Conversely, if the loss associated with some incorrect selection is great, its utility must be signiﬁcantly less than that of the best option, rendering such an event extremely unlikely. [sent-184, score-0.209]

77 Under RL , our transformation TL does not, in general, improve the value EUS L (S) of a recommendation set S. [sent-186, score-0.176]

78 Second, EPU of the optimal query under the noiseless model is at least as great that of the optimal query under the logistic model: EPU ∗ L (θ) ≥ EPU ∗ (θ). [sent-190, score-0.41]

79 1 We now derive N L our main result for logistic responses: the EUS of the optimal recommendation set (and hence its EPU) is at most ∆max less than the EPU of the optimal query set. [sent-191, score-0.38]

80 For a speciﬁc value of z ≥ 0, EUS-loss is exactly the utility difference z times the probability of choosing the less preferred option under RL : 1 − L(γz) = L(−γz) where L is the logistic function. [sent-199, score-0.356]

81 The maximal loss ∆max = f (zmax ) for a set of two hypothetical items s1 and s2 is attained by having the same utility difference u(s1 , w) − u(s2 , w) = zmax for any w ∈ W . [sent-202, score-0.263]

82 This bound can be expressed on a scale that is independent of the temperature parameter γ; intuitively, ∆max corresponds to a utility difference so slight that the user identiﬁes the best item only with probability 0. [sent-207, score-0.397]

83 In other words, the maximum loss is so small that the user is unable able to identify the preferred item 44% of the time when asked to compare the two items in S. [sent-209, score-0.272]

84 Given a set S of k elements, computing EPU (S, θ) requires Bayesian update of θ for each possible response, and expected utility optimization for each such posterior. [sent-216, score-0.23]

85 2 allows us to compute optimal query sets using EUS-maximization under RC and RNL , reducing complexity by a factor of n. [sent-221, score-0.178]

86 Greedy Optimization A simple greedy algorithm can be used construct a recommendation set of size k by iteratively adding the option offering the greatest improvement in value: arg maxx EUS R (S ∪ {x}; θ). [sent-224, score-0.361]

87 1), the greedy algorithm determines a set with EUS that is within η = 1 − ( k−1 )k of the optimal value k 1 EPU L (S; θ) is not necessarily less than EPU NL (S; θ): there are sets S for which a noisy response might be “more informative” than a noiseless one. [sent-226, score-0.261]

88 2 again allows us to use greedy maximization of EUS to determine a query set with similar gaurantees. [sent-232, score-0.212]

89 We consider several initialization strategies: random (randomly choose k options), sampling (include x∗ (θ), and sample k − 1 points wi from P (w; θ), and for each of these add the optimal item to S, while forcing distinctness) and greedy (initialize with the greedy set Sg ). [sent-253, score-0.215]

90 This means for comparison queries (|S| = 2), QI achieves at least 50% of the optimal recommendation set value. [sent-260, score-0.277]

91 Evaluation We compare the strategies above empirically on choice problems with random user utility functions using both noiseless and noisy response models. [sent-271, score-0.552]

92 Figure 1(a) shows the average loss of our strategies in an apartment rental dataset, with 187 outcomes, each characterized by 10 attributes (either numeric or categorical with domain sizes 2–6), when asking pairwise comparison queries with noiseless responses. [sent-277, score-0.237]

93 We compare the greedy and QI strategies (exact methods are impractical on problems of this size) in Figure 1(b); we also consider a hybrid greedy(EUS,NL) strategy that optimizes “assuming” noiseless responses, but is evaluated using the true response model RL . [sent-293, score-0.23]

94 6 Conclusions We have provided a novel analysis of set-based recommendations in Bayesian recommender systems, and have shown how it is offers a tractable means of generating myopically optimal or nearoptimal choice queries for preference elicitation. [sent-345, score-0.253]

95 We examined several user response models, showing that optimal recommendation sets are EVOI-optimal queries under noiseless and constant noise models; and that they are near-optimal under the logistic/Luce-Sheppard model (both theoretically and practically). [sent-346, score-0.58]

96 We stress that our results are general and do not depend on the speciﬁc implementation of Bayesian update, nor on the speciﬁc form of the utility function. [sent-347, score-0.196]

97 Further theoretical and practical investigation of local search strategies such as query iteration is important. [sent-351, score-0.174]

98 Another direction is the development of strategies for Bayesian recommendation and elicitation in large-scale conﬁguration problems, e. [sent-352, score-0.298]

99 Constraint-based optimization and utility elicitation using the minimax decision criterion. [sent-366, score-0.301]

100 Interdependence and additivity in multivariate, unidimensional expected utility theory. [sent-396, score-0.218]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eus', 0.792), ('epu', 0.347), ('utility', 0.196), ('recommendation', 0.176), ('user', 0.146), ('query', 0.133), ('nl', 0.12), ('rnl', 0.107), ('elicitation', 0.093), ('options', 0.091), ('option', 0.089), ('eu', 0.086), ('queries', 0.077), ('qi', 0.076), ('noiseless', 0.073), ('evoi', 0.065), ('greedy', 0.065), ('response', 0.063), ('rl', 0.061), ('sg', 0.056), ('preferred', 0.048), ('rc', 0.044), ('submodular', 0.038), ('tl', 0.038), ('craig', 0.037), ('item', 0.036), ('recommender', 0.035), ('submodularity', 0.033), ('choice', 0.03), ('strategies', 0.029), ('preference', 0.029), ('myopically', 0.029), ('items', 0.029), ('recommendations', 0.029), ('outcomes', 0.028), ('preferences', 0.025), ('paolo', 0.025), ('rental', 0.025), ('viappiani', 0.025), ('zmax', 0.025), ('responses', 0.024), ('optimal', 0.024), ('xi', 0.023), ('logistic', 0.023), ('expected', 0.022), ('pr', 0.021), ('particles', 0.021), ('sets', 0.021), ('beliefs', 0.02), ('belief', 0.02), ('dom', 0.02), ('attributes', 0.02), ('temperature', 0.019), ('boutilier', 0.019), ('conjoint', 0.018), ('sl', 0.017), ('bayesian', 0.017), ('exacteus', 0.017), ('uncertainty', 0.016), ('maxx', 0.016), ('noisy', 0.015), ('greatest', 0.015), ('outcome', 0.015), ('chajewska', 0.014), ('multiattribute', 0.014), ('admitting', 0.014), ('gai', 0.014), ('slate', 0.014), ('xj', 0.014), ('nk', 0.014), ('particle', 0.014), ('sampling', 0.014), ('maximization', 0.014), ('feasible', 0.014), ('lw', 0.013), ('queue', 0.013), ('lambert', 0.013), ('decisions', 0.013), ('loss', 0.013), ('sd', 0.013), ('myopic', 0.012), ('luce', 0.012), ('recommending', 0.012), ('max', 0.012), ('lemma', 0.012), ('optimization', 0.012), ('infeasible', 0.012), ('intelligence', 0.012), ('iteration', 0.012), ('collaborative', 0.011), ('dw', 0.011), ('rand', 0.011), ('logit', 0.011), ('additive', 0.011), ('dominated', 0.011), ('users', 0.011), ('initialization', 0.011), ('unrealistic', 0.011), ('cars', 0.011), ('arti', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

Author: Paolo Viappiani, Craig Boutilier

2 0.22629565 100 nips-2010-Gaussian Process Preference Elicitation

Author: Shengbo Guo, Scott Sanner, Edwin V. Bonilla

Abstract: Bayesian approaches to preference elicitation (PE) are particularly attractive due to their ability to explicitly model uncertainty in users’ latent utility functions. However, previous approaches to Bayesian PE have ignored the important problem of generalizing from previous users to an unseen user in order to reduce the elicitation burden on new users. In this paper, we address this deﬁciency by introducing a Gaussian Process (GP) prior over users’ latent utility functions on the joint space of user and item features. We learn the hyper-parameters of this GP on a set of preferences of previous users and use it to aid in the elicitation process for a new user. This approach provides a ﬂexible model of a multi-user utility function, facilitates an efﬁcient value of information (VOI) heuristic query selection strategy, and provides a principled way to incorporate the elicitations of multiple users back into the model. We show the effectiveness of our method in comparison to previous work on a real dataset of user preferences over sushi types. 1

3 0.099801779 268 nips-2010-The Neural Costs of Optimal Control

Author: Samuel Gershman, Robert Wilson

Abstract: Optimal control entails combining probabilities and utilities. However, for most practical problems, probability densities can be represented only approximately. Choosing an approximation requires balancing the beneﬁts of an accurate approximation against the costs of computing it. We propose a variational framework for achieving this balance and apply it to the problem of how a neural population code should optimally represent a distribution under resource constraints. The essence of our analysis is the conjecture that population codes are organized to maximize a lower bound on the log expected utility. This theory can account for a plethora of experimental data, including the reward-modulation of sensory receptive ﬁelds, GABAergic effects on saccadic movements, and risk aversion in decisions under uncertainty. 1

4 0.091972359 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1

5 0.060839199 39 nips-2010-Bayesian Action-Graph Games

Author: Albert X. Jiang, Kevin Leyton-brown

Abstract: Games of incomplete information, or Bayesian games, are an important gametheoretic model and have many applications in economics. We propose Bayesian action-graph games (BAGGs), a novel graphical representation for Bayesian games. BAGGs can represent arbitrary Bayesian games, and furthermore can compactly express Bayesian games exhibiting commonly encountered types of structure including symmetry, action- and type-speciﬁc utility independence, and probabilistic independence of type distributions. We provide an algorithm for computing expected utility in BAGGs, and discuss conditions under which the algorithm runs in polynomial time. Bayes-Nash equilibria of BAGGs can be computed by adapting existing algorithms for complete-information normal form games and leveraging our expected utility algorithm. We show both theoretically and empirically that our approaches improve signiﬁcantly on the state of the art. 1

6 0.056573767 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

7 0.043804355 25 nips-2010-Active Learning by Querying Informative and Representative Examples

8 0.040646009 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

9 0.040292233 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem

10 0.03979826 258 nips-2010-Structured sparsity-inducing norms through submodular functions

11 0.038562842 27 nips-2010-Agnostic Active Learning Without Constraints

12 0.036511321 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

13 0.032830559 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

14 0.030286131 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions

15 0.029126385 166 nips-2010-Minimum Average Cost Clustering

16 0.026368439 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

17 0.025398312 152 nips-2010-Learning from Logged Implicit Exploration Data

18 0.023951774 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable

19 0.023812778 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

20 0.023605209 261 nips-2010-Supervised Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.082), (1, -0.004), (2, 0.016), (3, 0.013), (4, -0.034), (5, 0.037), (6, -0.035), (7, 0.004), (8, 0.044), (9, -0.039), (10, 0.042), (11, 0.013), (12, -0.055), (13, -0.093), (14, -0.011), (15, 0.017), (16, -0.088), (17, 0.033), (18, -0.028), (19, 0.107), (20, -0.025), (21, 0.066), (22, -0.004), (23, -0.13), (24, 0.019), (25, -0.087), (26, -0.035), (27, -0.012), (28, -0.065), (29, 0.193), (30, 0.048), (31, -0.025), (32, 0.133), (33, -0.043), (34, 0.151), (35, 0.1), (36, 0.001), (37, 0.013), (38, 0.079), (39, -0.113), (40, -0.135), (41, 0.123), (42, -0.147), (43, -0.075), (44, 0.202), (45, 0.094), (46, -0.019), (47, 0.006), (48, -0.019), (49, 0.178)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96188247 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

Author: Paolo Viappiani, Craig Boutilier

2 0.8393268 100 nips-2010-Gaussian Process Preference Elicitation

Author: Shengbo Guo, Scott Sanner, Edwin V. Bonilla

3 0.46376315 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem

Author: Gilbert Leung, Novi Quadrianto, Kostas Tsioutsiouliklis, Alex J. Smola

Abstract: We present a fast online solver for large scale parametric max-ﬂow problems as they occur in portfolio optimization, inventory management, computer vision, and logistics. Our algorithm solves an integer linear program in an online fashion. It exploits total unimodularity of the constraint matrix and a Lagrangian relaxation to solve the problem as a convex online game. The algorithm generates approximate solutions of max-ﬂow problems by performing stochastic gradient descent on a set of ﬂows. We apply the algorithm to optimize tier arrangement of over 84 million web pages on a layered set of caches to serve an incoming query stream optimally. 1

4 0.41728288 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

5 0.40054846 268 nips-2010-The Neural Costs of Optimal Control

Author: Samuel Gershman, Robert Wilson

6 0.34687036 39 nips-2010-Bayesian Action-Graph Games

7 0.3096351 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

8 0.30357918 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

9 0.26770169 233 nips-2010-Scrambled Objects for Least-Squares Regression

10 0.22744851 226 nips-2010-Repeated Games against Budgeted Adversaries

11 0.21952842 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

12 0.21773823 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

13 0.21258986 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

14 0.20913458 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

15 0.20098281 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

16 0.19001187 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

17 0.18595251 120 nips-2010-Improvements to the Sequence Memoizer

18 0.18434031 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

19 0.18302369 215 nips-2010-Probabilistic Deterministic Infinite Automata

20 0.1812968 155 nips-2010-Learning the context of a category

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.034), (24, 0.012), (27, 0.053), (30, 0.038), (33, 0.014), (45, 0.294), (50, 0.045), (52, 0.026), (53, 0.012), (60, 0.028), (69, 0.234), (77, 0.036), (79, 0.016), (90, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87751162 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets

Author: Paolo Viappiani, Craig Boutilier

2 0.82665735 155 nips-2010-Learning the context of a category

Author: Dan Navarro

Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is ﬁt to multiple data sets, and provides a parsimonious method for describing how humans learn context speciﬁc conceptual representations.

3 0.80138654 114 nips-2010-Humans Learn Using Manifolds, Reluctantly

Author: Tim Rogers, Chuck Kalish, Joseph Harrison, Xiaojin Zhu, Bryan R. Gibson

Abstract: When the distribution of unlabeled data in feature space lies along a manifold, the information it provides may be used by a learner to assist classiﬁcation in a semi-supervised setting. While manifold learning is well-known in machine learning, the use of manifolds in human learning is largely unstudied. We perform a set of experiments which test a human’s ability to use a manifold in a semisupervised learning task, under varying conditions. We show that humans may be encouraged into using the manifold, overcoming the strong preference for a simple, axis-parallel linear boundary. 1

4 0.790999 212 nips-2010-Predictive State Temporal Difference Learning

Author: Byron Boots, Geoffrey J. Gordon

Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identiﬁcation. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identiﬁcation (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD ﬁnds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difﬁcult optimal stopping problem. 1

5 0.79042715 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difﬁculty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufﬁciently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to ﬁnd a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difﬁcult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be deﬁned by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will ﬁnd some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary ﬁnite data set. To our knowledge, this is the ﬁrst result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classiﬁer uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is deﬁned as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to ﬁnd m=1 weights θ that correctly classify the training examples. Consider ﬁrst the separable case. Deﬁne the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is deﬁned by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while ﬁxing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while ﬁxing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random ﬁelds (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by conﬁgurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing conﬁgurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to ﬁnd a violated constraint. For the pseudo-max objective, ﬁnding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisﬁes particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identiﬁed to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simpliﬁed instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identiﬁability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to inﬁnity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four conﬁgurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufﬁciently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the conﬁgurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the ﬁgure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identiﬁability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction deﬁned on a ﬁxed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node ﬁelds θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k speciﬁes the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input ﬁelds xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufﬁcient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classiﬁcation constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input ﬁelds xi (yi ) already sufﬁce to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the ﬁrst example, we set the input ﬁelds for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input ﬁelds can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ ﬁeld for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is ﬁnite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for ﬁnite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Deﬁne Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any ﬁnite xi and ﬁnite J ∗ , the other xj ’s can be chosen so as to give the y conﬁguration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we deﬁne h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and deﬁne ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the ﬁrst term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function deﬁned in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difﬁculty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certiﬁcate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a ﬁxed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We ﬁrst construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The ﬁrst i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difﬁculty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subﬁgure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classiﬁcation task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not ﬂattened either), or because the consistency conditions are not fully satisﬁed: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random ﬁeld (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to ﬁnding the most likely assignment in a pairwise MRF, and ﬁts naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for ﬁnding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We signiﬁcantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Speciﬁcally, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly ﬁve times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difﬁculty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to ﬁnd parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufﬁcient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufﬁcient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufﬁciently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-speciﬁcation [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9

6 0.78924781 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

7 0.78880697 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control

8 0.78799421 103 nips-2010-Generating more realistic images using gated MRF's

9 0.78779483 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

10 0.78721243 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

11 0.78716445 162 nips-2010-Link Discovery using Graph Feature Tracking

12 0.78707689 94 nips-2010-Feature Set Embedding for Incomplete Data

13 0.78701711 287 nips-2010-Worst-Case Linear Discriminant Analysis

14 0.78688431 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect

15 0.78671014 177 nips-2010-Multitask Learning without Label Correspondences

16 0.78665781 208 nips-2010-Policy gradients in linearly-solvable MDPs

17 0.7865355 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

18 0.78623819 144 nips-2010-Learning Efficient Markov Networks

19 0.78616166 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

20 0.7860356 224 nips-2010-Regularized estimation of image statistics by Score Matching