nips nips2011 nips2011-3 knowledge-graph by maker-knowledge-mining

3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems


Source: pdf

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. [sent-8, score-0.419]

2 The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. [sent-9, score-0.195]

3 In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. [sent-10, score-0.562]

4 The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. [sent-11, score-0.167]

5 This type of prediction competition provides a nice tool for companies and institutions that need help with a given prediction task yet can not afford to hire an expert. [sent-19, score-0.22]

6 This is in stark contrast to many other projects that rely on crowdsourcing – Wikipedia being a prime example, where participants must build off the work of others. [sent-24, score-0.308]

7 Indeed, in the case of the Netflix prize, not only do leading participants lack incentives to share, but the work of non-winning competitors is effectively wasted. [sent-25, score-0.382]

8 In this paper we describe a new and very general mechanism to crowdsource prediction/learning problems. [sent-36, score-0.227]

9 Our mechanism requires participants to place bets, yet the space they are betting over is the set of hypotheses for the learning task at hand. [sent-37, score-0.465]

10 At any given time the mechanism publishes the current hypothesis w and participants can wager on a modification of w to w , upon which the modified w is posted. [sent-38, score-0.621]

11 Eventually the wagering period finishes, a set of test data is revealed, and each participant receives a payout according to their bets. [sent-39, score-0.533]

12 The framework we propose has many qualities similar to that of an information or prediction market, and many of the ideas derive from recent research on the design of automated market makers [7, 8, 3, 4, 1]. [sent-41, score-0.351]

13 Many information markets already exist; at sites like Intrade. [sent-42, score-0.188]

14 There has been a burst of interest in such markets in recent years, not least of which is due to their potential for combining large amounts of information from a range of sources. [sent-45, score-0.188]

15 In the words of Hanson et al [9]: “Rational expectations theory predicts that, in equilibrium, asset prices will reflect all of the information held by market participants. [sent-46, score-0.304]

16 This theorized information aggregation property of prices has lead economists to become increasingly interested in using securities markets to predict future events. [sent-47, score-0.333]

17 ” In practice, prediction markets have proven impressively accurate as a forecasting tool [11, 2, 12]. [sent-48, score-0.263]

18 The central contribution of the present paper is to take the framework of a prediction market as a tool for information aggregation and to apply this tool for the purpose of “aggregating” a hypothesis (classifier, predictor, etc. [sent-49, score-0.441]

19 In contrast to the Netflix prize, which pitted teams of participants against each other, the mechanism we propose allows for everyone to contribute whatever knowledge they may have available towards the final solution. [sent-52, score-0.481]

20 Whereas a market price can be thought of as representing a consensus estimate of the value of an asset, our goal is to construct a consensus hypothesis reflecting all the knowledge and capabilities about a particular learning problem1 . [sent-54, score-0.325]

21 1 by introducing the simple notion of a generalized scoring rule L(·, ·) representing the “loss function” of the learning task at hand. [sent-56, score-0.187]

22 2 we describe our proposed Crowdsourced Learning Mechanism (CLM) in detail, and discuss how to structure a CLM for a particular scoring function L, in order that the traders are given incentives to minimize L. [sent-58, score-0.327]

23 In Section 4 we discuss previous work on the design of prediction markets using an automated prediction market maker (APMM). [sent-60, score-0.644]

24 We shall use ∆(S) to refer to the set of integrable probability distributions over the set 1 It is worth noting that Barbu and Lay utilized concepts from prediction markets to design algorithms for classifier aggregation [10], although their approach was unrelated to crowdsourcing. [sent-67, score-0.449]

25 A scoring rule is a function S : P × O → R where, for all P ∈ P, P ∈ argmaxQ∈P EX∼P S(Q, X). [sent-79, score-0.187]

26 Given a convex hypothesis space H ⊂ Rn and an outcome space O, let L : H × O → R be a continuous function. [sent-82, score-0.195]

27 The generalized scoring rule shall represent the “loss function” for the learning problem at hand, and in Section 2. [sent-85, score-0.288]

28 The hypothesis w shall represent the advice we receive from the crowd, X shall represent the test data to be revealed at the close of the mechanism, and L(w; X) shall represent the loss of the advised w on the data X. [sent-87, score-0.484]

29 Our scoring rule differs from traditional scoring rules in an important way. [sent-90, score-0.369]

30 Instead of starting with the desire know about the true value of X, and then designing a scoring rule which incentivizes participants to elicit their belief P ∈ P, our objective is precisely to minimize our scoring rule. [sent-91, score-0.655]

31 In other words, traditional scoring rules were a means to an end (eliciting P ) but our generalized scoring rule is the end itself. [sent-92, score-0.369]

32 One can recover the traditional scoring rule definition by setting H = P and imposing the constraint that P ∈ WL (P ). [sent-93, score-0.187]

33 As we will see in section 4, this also leads to efficient prediction markets and crowdsourcing mechanisms. [sent-103, score-0.321]

34 2 The Crowdsourced Learning Mechanism We will now define our actual mechanism rigorously. [sent-105, score-0.189]

35 The function Cost : H × H → R sets the cost charged to a participant that makes a modification to the posted hypothesis. [sent-108, score-0.227]

36 The function Payout : H × H × O → R determines the amount paid to each participant when the outcome is revealed to be X. [sent-109, score-0.249]

37 Of course, this profit will precisely determine the incentives of our mechanism, and hence a key question is: how can we design Cost and Payout so that participants are incentivized to provide good hypotheses? [sent-114, score-0.427]

38 The answer is that we shall structure the incentives around a GSR L(w; X) chosen by the mechanism designer. [sent-115, score-0.392]

39 When a CLM implements a given L, the incentives are structured in order that the participants will work to minimize L(w; X). [sent-120, score-0.403]

40 Of course, the input X is unknown to the participants, yet we can assume that the mechanism has provided a public “training set” to use in a learning algorithm. [sent-121, score-0.189]

41 The participants are thus asked not only to propose a “good” hypothesis wt but to wager on whether the update wt−1 → wt improves generalization error. [sent-122, score-0.644]

42 Given a GSR L : H × O → R and an L-CLM (Cost, Payout), any participant who knows the true distribution P ∈ P over X will maximize expected profit by modifying the hypothesis to any w ∈ WL (P ). [sent-125, score-0.204]

43 It is clear that the agent operating the mechanism must pay the participants at the close of the competition, and is thus at risk of losing money (in fact, it is possible he may gain). [sent-127, score-0.536]

44 How much money is lost depends on the bets (wt → wt+1 ) made by the participants, and of course the final outcome X. [sent-128, score-0.219]

45 The agent has a clear interest in knowing precisely the potential cost – fortunately this cost is easy to compute. [sent-129, score-0.303]

46 This is a simple yet appealing property of the CLM: the agent pays only as much in reward to the participants as it benefits from the improvement of wT over the initial w0 . [sent-131, score-0.363]

47 A more typical scenario, of course, is where the participants provide an improved hypothesis, in which case the CLM will run at a cost. [sent-133, score-0.25]

48 Given a budget 4 of size $B, the mechanism can always rescale L in order that WorstCaseLoss(L-CLM) = B. [sent-135, score-0.189]

49 We shall say a CLM has the tractable trading (TT) property if, given a current hypothesis w, a belief P ∈ ∆(O) and a budget B, one can efficiently compute an element of the set argmin EX∼P Profit(w, w , X) : Cost(w, w ) ≤ B . [sent-139, score-0.261]

50 w ∈H The EC property ensures that the mechanism operator can run the CLM efficiently. [sent-140, score-0.22]

51 The TT property says that participants can compute the optimal hypothesis to bet on given a belief on the outcome and a budget. [sent-141, score-0.527]

52 This is absolutely essential for the CLM to successfully aggregate the knowledge and expertise of the crowd – without it, despite their motivation to lower L(; ), the participants would not be able to compute the optimal bet. [sent-142, score-0.32]

53 We say that a CLM has the escrow (ES) property if the Cost and Payout functions are structured in order that, given any wager (w → w ), we have that Payout(w, w ; X) ≥ 0 for all X ∈ O. [sent-144, score-0.163]

54 The problem with this approach is that potentially Payout(w, w ; X) < 0 which implies that the participant who wagered on (w → w ) can be indebted to the mechanism and could default on this obligation. [sent-147, score-0.299]

55 Thus the Cost function should be set in order to require every participant to deposit at least enough collateral in escrow to cover any possible losses. [sent-148, score-0.183]

56 One practical weakness of a wagering-based mechanism is that individuals may be hesitant to participate when it requires depositing actual money into the system. [sent-150, score-0.277]

57 This can be allayed to a reasonable degree by including a voucher pool where each of the first m participants may receive a voucher in the amount of $C. [sent-151, score-0.368]

58 Imagine a firm is looking to do compression on an unfamiliar channel, and from this channel the firm will receive a stream of m characters from an n-sized alphabet which we shall index by [n]. [sent-155, score-0.233]

59 It is easy to see that if the characters are sampled from some “true” distribution p, then the expected cost L(q; p) := Ei∼p [L(q; i)] = KL(p; q) + H(p), which is minimized at q = p. [sent-160, score-0.167]

60 We set H = ∆n and O = [n], where a hypothesis q represents the proposed distribution over the n characters, and X is some character sampled uniformly from the stream after it has been observed. [sent-165, score-0.196]

61 We have devised this payout scheme according to the selection of a single character i, and it is worth noting that because this character is sampled uniformly at random from the stream (with private randomness), the participants cannot know which character will be released. [sent-169, score-0.89]

62 This forces the ˆ participants to wager on the empirical distribution p of the characters from the stream. [sent-170, score-0.388]

63 A reasonable ˆ alternative, and one which lowers the payment variance, is to payout according to the L(q; p), which is also equal to the average of L(q; i) when i is chosen uniformly from the stream. [sent-171, score-0.394]

64 More precisely, if the firm uses the final estimate qT from the mechanism, instead of the initial guess q0 , what is the trade-off between the money paid to participants and the money gained by using the crowdsourced hypothesis? [sent-173, score-0.484]

65 At first glance, it appears that this trade-off can be arbitrarily bad: the worst case cost of encoding the stream using the final estimate qT is supi,qT − log(qT (i)) = ∞. [sent-174, score-0.205]

66 Amazingly, however, by virtue of the aligned incentives, the firm has a very strong control of its total cost (the CLM cost plus the encoding cost). [sent-175, score-0.27]

67 4 Prediction Markets as a Special Case Let us briefly review the literature for the type of prediction markets relevant to the present work. [sent-184, score-0.263]

68 Hanson [7, 8] proposed a framework in which traders make “reports” to the market about their internal belief in the form of a distribution p ∈ ∆n . [sent-186, score-0.339]

69 Each trader would receive a reward (or loss) based on a function of their proposed belief and the belief of the previous trader, and the function suggested by Hanson was the Logarithmic Market Scoring Rule (LMSR). [sent-187, score-0.188]

70 It was shown later that the LMSR-based market is equivalent to what is known as a cost function based automated market makers, proposed by Chen and Pennock [3]. [sent-188, score-0.624]

71 More recently a much broader equivalence was established by Chen and Wortman Vaughan [4] between markets based on cost functions and those based on scoring rules. [sent-189, score-0.457]

72 The market framework proposed by Chen and Pennock allows traders to buy and sell Arrow-Debreu securities (equivalently: shares, contracts), where an Arrow-Debreu security corresponding to outcome i pays out $1 if and only if i is realized. [sent-190, score-0.482]

73 All shares are bought and sold through an automated market maker, which is the entity managing the market and setting prices. [sent-191, score-0.55]

74 At any time period, traders can purchase bundles of contracts r ∈ Rn , where r(i) represents the number of shares purchased on 6 outcome i. [sent-192, score-0.23]

75 The price of a bundle r is set as C(s + r) − C(s), where C is some differentiable convex cost function and s ∈ Rn is the “quantity vector” representing the total number of outstanding n 1 shares. [sent-193, score-0.194]

76 When the set of potential outcomes O is of exponential size or even infinite, the market designer can offer a restricted number of contracts, say n ( |O|), rather than offer an Arrow-Debreu contract for each member of O. [sent-197, score-0.321]

77 To determine the payout structure, the market designer chooses a function ρ : O → Rn , where contract i returns a payout of ρi (X) and, thus, a contract bundle r pays ρ(X) · r. [sent-198, score-1.245]

78 As with the framework of Chen and Pennock, the contract prices are set according to a cost function C, so that a bundle r has a price of C(s + r) − C(s). [sent-199, score-0.253]

79 For the remainder of this section we shall discuss the prediction market template of Abernethy et al. [sent-202, score-0.407]

80 as it provides the most general model; we shall refer to such a market as an Automated Prediction Market Maker. [sent-203, score-0.332]

81 (3) Hence we can think of APMM prediction markets in terms of our learning mechanism. [sent-209, score-0.293]

82 We now ask, what is the learning problem that the participants of an APMM are trying to solve? [sent-213, score-0.25]

83 There is another more subtle benefit to APMMs – and, in fact, to most prediction market mechanisms in practice – which is that participants make bets via purchasing of shares or share bundles. [sent-217, score-0.65]

84 When a trader makes a bet, she purchases a contract bundle r, is charged C(s + r) − C(s) (when the current quantity vector is s), and shall receive payout ρ(X) · r if and when X is realized. [sent-218, score-0.705]

85 In this sense bets made in an APMM are stateless, whereas for an arbitrary CLM this may not be the case: the wager defined by (wt → wt+1 ) can not necessarily be sold back to the mechanism, as the posted hypothesis may no longer remain at wt+1 . [sent-220, score-0.233]

86 Theorem 1 establishes a strong connection between prediction markets and a natural class of GSRs. [sent-232, score-0.263]

87 One interpretation of this result is that any GSR based on a Bregman divergence has a “dual” characterization as a share-based market, where participants buy and sell shares rather than directly altering the share prices (the hypothesis). [sent-233, score-0.372]

88 This has many advantages for prediction markets, not least of which is that shares are often easier to think about than the underlying hypothesis space. [sent-234, score-0.242]

89 We let H be the d 2 -norm ball of radius 1 in R , and we shall let an outcome be a batch of a data, that is X := {(x1 , y1 ), . [sent-240, score-0.171]

90 A budget-constrained profit-maximizing participant must simply solve a convex optimization problem, and hence the TT property holds. [sent-253, score-0.172]

91 The host then requests participants to provide predictions on a specified set of instances on which it has correct labels. [sent-257, score-0.25]

92 Of course, this approach is quite different from the Netflix Prize model, in two key respects: (a) the participants have to wager on their predictions and (b) by participating in the mechanism they are required to reveal their modification to all of the other players. [sent-264, score-0.527]

93 Hence while we have structured a competitive process the participants are de facto forced to collaborate on the solution. [sent-265, score-0.25]

94 A reasonable critique of this collaborative mechanism approach to a Netflix-style competition is that it does not provide the instant feedback of the “leaderboard” where individuals observe performance improvements in real time. [sent-266, score-0.288]

95 However, we can adjust our mechanism to be online with a very simple modification of the CLM protocol, which we sketch here. [sent-267, score-0.189]

96 At each interval, the designer could select a (potentially random) subset S of user/movie pairs in the remaining test set, freeze updates on the predictions w(k) for all k ∈ S, and perform payouts to the participants on only these labels. [sent-269, score-0.338]

97 What makes this possible, of course, is that the generalized scoring rule we chose decomposes as a sum over the individual labels. [sent-270, score-0.187]

98 Results from a dozen years of election futures markets research. [sent-287, score-0.188]

99 A new understanding of prediction markets via no-regret learning. [sent-302, score-0.263]

100 Logarithmic market scoring rules for modular combinatorial information aggregation. [sent-324, score-0.413]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clm', 0.453), ('payout', 0.394), ('participants', 0.25), ('apmm', 0.234), ('market', 0.231), ('mechanism', 0.189), ('markets', 0.188), ('gsr', 0.175), ('scoring', 0.152), ('cost', 0.117), ('participant', 0.11), ('wt', 0.106), ('qt', 0.102), ('incentives', 0.102), ('shall', 0.101), ('ix', 0.097), ('prize', 0.094), ('hypothesis', 0.094), ('profit', 0.088), ('trader', 0.088), ('wager', 0.088), ('rm', 0.084), ('net', 0.08), ('crowdsourced', 0.077), ('prediction', 0.075), ('traders', 0.073), ('outcome', 0.07), ('competition', 0.07), ('hanson', 0.064), ('kl', 0.061), ('money', 0.059), ('competitions', 0.058), ('worstcaseloss', 0.058), ('abernethy', 0.058), ('crowdsourcing', 0.058), ('stream', 0.052), ('bets', 0.051), ('implements', 0.051), ('characters', 0.05), ('dr', 0.05), ('wl', 0.05), ('character', 0.05), ('pro', 0.05), ('bet', 0.047), ('bundle', 0.046), ('contract', 0.046), ('automated', 0.045), ('designer', 0.044), ('pays', 0.044), ('prices', 0.044), ('worth', 0.044), ('contracts', 0.044), ('escrow', 0.044), ('incentivized', 0.044), ('lmsr', 0.044), ('payouts', 0.044), ('pennock', 0.044), ('relint', 0.044), ('voucher', 0.044), ('shares', 0.043), ('crowd', 0.042), ('teams', 0.042), ('aggregation', 0.041), ('ex', 0.041), ('course', 0.039), ('paid', 0.039), ('crowdsource', 0.038), ('agent', 0.038), ('encoding', 0.036), ('sell', 0.035), ('wortman', 0.035), ('rule', 0.035), ('ec', 0.035), ('belief', 0.035), ('economic', 0.032), ('precisely', 0.031), ('convex', 0.031), ('property', 0.031), ('competitors', 0.03), ('maker', 0.03), ('rules', 0.03), ('rn', 0.03), ('receive', 0.03), ('revealed', 0.03), ('think', 0.03), ('individuals', 0.029), ('asset', 0.029), ('bid', 0.029), ('clms', 0.029), ('collateral', 0.029), ('huffman', 0.029), ('proprietary', 0.029), ('securities', 0.029), ('wagering', 0.029), ('chen', 0.029), ('expertise', 0.028), ('bregman', 0.027), ('advice', 0.027), ('released', 0.026), ('betting', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. 1

2 0.090886809 90 nips-2011-Evaluating the inverse decision-making approach to preference learning

Author: Alan Jern, Christopher G. Lucas, Charles Kemp

Abstract: Psychologists have recently begun to develop computational accounts of how people infer others’ preferences from their behavior. The inverse decision-making approach proposes that people infer preferences by inverting a generative model of decision-making. Existing data sets, however, do not provide sufficient resolution to thoroughly evaluate this approach. We introduce a new preference learning task that provides a benchmark for evaluating computational accounts and use it to compare the inverse decision-making approach to a feature-based approach, which relies on a discriminative combination of decision features. Our data support the inverse decision-making approach to preference learning. A basic principle of decision-making is that knowing people’s preferences allows us to predict how they will behave: if you know your friend likes comedies and hates horror films, you can probably guess which of these options she will choose when she goes to the theater. Often, however, we do not know what other people like and we can only infer their preferences from their behavior. If you know that a different friend saw a comedy today, does that mean that he likes comedies in general? The conclusion you draw will likely depend on what else was playing and what movie choices he has made in the past. A goal for social cognition research is to develop a computational account of people’s ability to infer others’ preferences. One computational approach is based on inverse decision-making. This approach begins with a model of how someone’s preferences lead to a decision. Then, this model is inverted to determine the most likely preferences that motivated an observed decision. An alternative approach might simply learn a functional mapping between features of an observed decision and the preferences that motivated it. For instance, in your friend’s decision to see a comedy, perhaps the more movie options he turned down, the more likely it is that he has a true preference for comedies. The difference between the inverse decision-making approach and the feature-based approach maps onto the standard dichotomy between generative and discriminative models. Economists have developed an instance of the inverse decision-making approach known as the multinomial logit model [1] that has been widely used to infer consumer’s preferences from their choices. This model has recently been explored as a psychological model [2, 3, 4], but there are few behavioral data sets for evaluating it as a model of how people learn others’ preferences. Additionally, the data sets that do exist tend to be drawn from the developmental literature, which focuses on simple tasks that collect only one or two judgments from children [5, 6, 7]. The limitations of these data sets make it difficult to evaluate the multinomial logit model with respect to alternative accounts of preference learning like the feature-based approach. In this paper, we use data from a new experimental task that elicits a detailed set of preference judgments from a single participant in order to evaluate the predictions of several preference learning models from both the inverse decision-making and feature-based classes. Our task requires each participant to sort a large number of observed decisions on the basis of how strongly they indicate 1 (a) (b) (c) d c c (d) b b a a x d x d c b a x 1. Number of chosen effects (−/+) 2. Number of forgone effects (+/+) 3. Number of forgone options (+/+) 4. Number of forgone options containing x (−/−) 5. Max/min number of effects in a forgone option (+/−) 6. Is x in every option? (−/−) 7. Chose only option with x? (+/+) 8. Is x the only difference between options? (+/+) 9. Do all options have same number of effects? (+/+) 10. Chose option with max/min number of effects? (−/−) Figure 1: (a)–(c) Examples of the decisions used in the experiments. Each column represents one option and the boxes represent different effects. The chosen option is indicated by the black rectangle. (d) Features used by the weighted feature and ranked feature models. Features 5 and 10 involved maxima in Experiment 1, which focused on all positive effects, and minima in Experiment 2, which focused on all negative effects. The signs in parentheses indicate the direction of the feature that suggests a stronger preference in Experiment 1 / Experiment 2. a preference for a chosen item. Because the number of decisions is large and these decisions vary on multiple dimensions, predicting how people will order them offers a challenging benchmark on which to compare computational models of preference learning. Data sets from these sorts of detailed tasks have proved fruitful in other domains. For example, data reported by Shepard, Hovland, and Jenkins [8]; Osherson, Smith, Wilkie, L´ pez, and Shafir [9]; and Wasserman, Elek, Chatlosh, o and Baker [10] have motivated much subsequent research on category learning, inductive reasoning, and causal reasoning, respectively. We first describe our preference learning task in detail. We then present several inverse decisionmaking and feature-based models of preference learning and compare these models’ predictions to people’s judgments in two experiments. The data are well predicted by models that follow the inverse decision-making approach, suggesting that this computational approach may help explain how people learn others’ preferences. 1 Multi-attribute decisions and revealed preferences We designed a task that can be used to elicit a large number of preference judgments from a single participant. The task involves a set of observed multi-attribute decisions, some examples of which are represented visually in Figure 1. Each decision is among a set of options and each option produces a set of effects. Figure 1 shows several decisions involving a total of five effects distributed among up to five options. The differently colored boxes represent different effects and the chosen option is marked by a black rectangle. For example, 1a shows a choice between an option with four effects and an option with a single effect; here, the decision maker chose the second option. In our task, people are asked to rank a large number of these decisions by how strongly they suggest that the decision maker had a preference for a particular effect (e.g., effect x in Figure 1). By imposing some minimal constraints, the space of unique multi-attribute decisions is finite and we can obtain rankings for every decision in the space. For example, Figure 2c shows a complete list of 47 unique decisions involving up to five effects, subject to several constraints described later. Three of these decisions are shown in Figure 1. If all the effects are positive—pieces of candy, for example—the first decision (1a) suggests a strong preference for candy x, because the decision maker turned down four pieces in favor of one. The second decision (1b), however, offers much weaker evidence because nearly everyone would choose four pieces of candy over one, even without a specific preference for x. The third decision (1c) provides evidence that is strong but perhaps not quite as strong as the first decision. When all effects are negative—like electric shocks at different body locations—decision makers may still find some effects more tolerable than others, but different inferences are sometimes supported. For example, for negative effects, 1a provides weak evidence that x is relatively tolerable because nearly everyone would choose one shock over four. 2 A computational account of preference learning We now describe a simple computational model for learning a person’s preferences after observing that person make a decision like the ones in Figure 1. We assume that there are n available options 2 {o1 , . . . , on }, each of which produces one or more effects from the set {f1 , f2 , ..., fm }. For simplicity, we assume that effects are binary. Let ui denote the utility the decision maker assigns to effect fi . We begin by specifying a model of decision-making that makes the standard assumptions that decision makers tend to choose things with greater utility and that utilities are additive. That is, if fj is a binary vector indicating the effects produced by option oj and u is a vector of utilities assigned to each of the m effects, then the total utility associated with option oj can be expressed as Uj = fj T u. We complete the specification of the model by applying the Luce choice rule [11], a common psychological model of choice behavior, as the function that chooses among the options: p(c = oj |u, f ) = exp(Uj ) = exp(Uk ) n k=1 exp(fj T u) n T k=1 exp(fk u) (1) where c denotes the choice made. This model can predict the choice someone will make among a specified set of options, given the utilities that person assigns to the effects in each option. To obtain estimates of someone’s utilities, we invert this model by applying Bayes’ rule: p(u|c, F) = p(c|u, F)p(u) p(c|F) (2) where F = {f1 , . . . , fn } specifies the available options and their corresponding effects. This is the multinomial logit model [1], a standard econometric model. In order to apply Equation 2 we must specify a prior p(u) on the utilities. We adopt a standard approach that places independent Gaussian priors on the utilities: ui ∼ N (µ, σ 2 ). For decisions where effects are positive—like candies—we set µ = 2σ, which corresponds to a prior distribution that places approximately 2% of the probability mass below zero. Similarly, for negative effects—like electric shocks—we set µ = −2σ. 2.1 Ordering a set of observed decisions Equation 2 specifies a posterior probability distribution over utilities for a single observed decision but does not provide a way to compare the inferences drawn from multiple decisions for the purposes of ordering them. Suppose we are interested in a decision maker’s preference for effect x and we wish to order a set of decisions by how strongly they support this preference. Two criteria for ordering the decisions are as follows: Absolute utility Relative utility p(c|ux , F)p(ux ) p(c|F) p(c|∀j ux ≥ uj , F)p(∀j ux ≥ uj ) p(∀j ux ≥ uj |c, F) = p(c|F) E(ux |c, F) = Eux The absolute utility model orders decisions by the mean posterior utility for effect x. This criterion is perhaps the most natural way to assess how much a decision indicates a preference for x, but it requires an inference about the utility of x in isolation, and research suggests that people often think about the utility of an effect only in relation to other salient possibilities [12]. The relative utility model applies this idea to preference learning by ordering decisions based on how strongly they suggest that x has a greater utility than all other effects. The decisions in Figures 1b and 1c are cases where the two models lead to different predictions. If the effects are all negative (e.g., electric shocks), the absolute utility model predicts that 1b provides stronger evidence for a tolerance for x because the decision maker chose to receive four shocks instead of just one. The relative utility model predicts that 1c provides stronger evidence because 1b offers no way to determine the relative tolerance of the four chosen effects with respect to one another. Like all generative models, the absolute and relative models incorporate three qualitatively different components: the likelihood term p(c|u, F), the prior p(u), and the reciprocal of the marginal likelihood 1/p(c|F). We assume that the total number of effects is fixed in advance and, as a result, the prior term will be the same for all decisions that we consider. The two other components, however, will vary across decisions. The inverse decision-making approach predicts that both components should influence preference judgments, and we will test this prediction by comparing our 3 two inverse decision-making models to two alternatives that rely only one of these components as an ordering criterion: p(c|∀j ux ≥ uj , F) 1/p(c|F) Representativeness Surprise The representativeness model captures how likely the observed decision would be if the utility for x were high, and previous research has shown that people sometimes rely on a representativeness computation of this kind [13]. The surprise model captures how unexpected the observed decision is overall; surprising decisions may be best explained in terms of a strong preference for x, but unsurprising decisions provide little information about x in particular. 2.2 Feature-based models We also consider a class of feature-based models that use surface features to order decisions. The ten features that we consider are shown in Figure 1d, where x is the effect of interest. As an example, the first feature specifies the number of effects chosen; because x is always among the chosen effects, decisions where few or no other effects belong to the chosen option suggest the strongest preference for x (when all effects are positive). This and the second feature were previously identified by Newtson [14]; we included the eight additional features shown in Figure 1d in an attempt to include all possible features that seemed both simple and relevant. We consider two methods for combining this set of features to order a set of decisions by how strongly they suggest a preference for x. The first model is a standard linear regression model, which we refer to as the weighted feature model. The model learns a weight for each feature, and the rank of a given decision is determined by a weighted sum of its features. The second model is a ranked feature model that sorts the observed decisions with respect to a strict ranking of the features. The top-ranked feature corresponds to the primary sort key, the second-ranked feature to the secondary sort key, and so on. For example, suppose that the top-ranked feature is the number of chosen effects and the second-ranked feature is the number of forgone options. Sorting the three decisions in Figure 1 according to this criterion produces the following ordering: 1a,1c,1b. This notion of sorting items on the basis of ranked features has been applied before to decision-making [15, 16] and other domains of psychology [17], but we are not aware of any previous applications to preference learning. Although our inverse decision-making and feature-based models represent two very different approaches, both may turn out to be valuable. An inverse decision-making approach may be the appropriate account of preference learning at Marr’s [18] computational level, and a feature-based approach may capture the psychological processes by which the computational-level account is implemented. Our goal, therefore, is not necessarily to accept one of these approaches and dismiss the other. Instead, we entertain three distinct possibilities. First, both approaches may account well for the data, which would support the idea that they are valid accounts operating at different levels of analysis. Second, the inverse decision-making approach may offer a better account, suggesting that process-level accounts other than the feature-based approach should be explored. Finally, the feature-based approach may offer a better account, suggesting that inverse decision-making does not constitute an appropriate computational-level account of preference learning. 3 Experiment 1: Positive effects Our first experiment focuses on decisions involving only positive effects. The full set of 47 decisions we used is shown in Figure 2c. This set includes every possible unique decision with up to five different effects, subject to the following constraints: (1) one of the effects (effect x) must always appear in the chosen option, (2) there are no repeated options, (3) each effect may appear in an option at most once, (4) only effects in the chosen option may be repeated in other options, and (5) when effects appear in multiple options, the number of effects is held constant across options. The first constraint is necessary for the sorting task, the second two constraints create a finite space of decisions, and the final two constraints limit attention to what we deemed the most interesting cases. Method 43 Carnegie Mellon undergraduates participated for course credit. Each participant was given a set of cards, with one decision printed on each card. The decisions were represented visually 4 (a) (c) Decisions 42 40 45 Mean human rankings 38 30 23 20 22 17 13 12 11 10 9 8 7 6 19 18 31 34 28 21 26 36 35 33 37 27 29 32 25 24 16 15 14 5 4 3 2 1 Absolute utility model rankings (b) Mean human rankings (Experiment 1) 47 43 44 46 45 38 37 36 34 35 30 32 33 31 29 28 24 26 27 25 21 19 22 20 18 16 17 12 13 7 6 11 5 9 4 10 8 1 2 3 42 40 41 39 47 46 44 41 43 39 23 15 14 Mean human rankings (Experiment 2) 1. dcbax 2. cbax 3. bax 4. ax 5. x 6. dcax | bcax 7. dx | cx | bx | ax 8. cax | bax 9. bdx | bcx | bax 10. dcx | bax 11. bx | ax 12. bdx | cax | bax 13. cx | bx | ax 14. d | cbax 15. c | bax 16. b | ax 17. d | c | bax 18. dc | bax 19. c | b | ax 20. dc | bx | ax 21. bdc | bax 22. ad | cx | bx | ax 23. d | c | b | ax 24. bad | bcx | bax 25. ac | bx | ax 26. cb | ax 27. cbad | cbax 28. dc | b | ax 29. ad | ac | bx | ax 30. ab | ax 31. bad | bax 32. dc | ab | ax 33. dcb | ax 34. a | x 35. bad | bac | bax 36. ac | ab | ax 37. ad | ac | ab | ax 38. b | a | x 39. ba | x 40. c | b | a | x 41. cb | a | x 42. d | c | b | a | x 43. cba | x 44. dc | ba | x 45. dc | b | a | x 46. dcb | a | x 47. dcba | x Figure 2: (a) Comparison between the absolute utility model rankings and the mean human rankings for Experiment 1. Each point represents one decision, numbered with respect to the list in panel c. (b) Comparison between the mean human rankings in Experiments 1 and 2. In both scatter plots, the solid diagonal lines indicate a perfect correspondence between the two sets of rankings. (c) The complete set of decisions, ordered by the mean human rankings from Experiment 1. Options are separated by vertical bars and the chosen option is always at the far right. Participants were always asked about a preference for effect x. as in Figure 1 but without the letter labels. Participants were told that the effects were different types of candy and each option was a bag containing one or more pieces of candy. They were asked to sort the cards by how strongly each decision suggested that the decision maker liked a particular target candy, labeled x in Figure 2c. They sorted the cards freely on a table but reported their final rankings by writing them on a sheet of paper, from weakest to strongest evidence. They were instructed to order the cards as completely as possible, but were told that they could assign the same ranking to a set of cards if they believed those cards provided equal evidence. 3.1 Results Two participants were excluded as outliers based on the criterion that their rankings for at least five decisions were at least three standard deviations from the mean rankings. We performed a hierarchical clustering analysis of the remaining 41 participants’ rankings using rank correlation as a similarity metric. Participants’ rankings were highly correlated: cutting the resulting dendrogram at 0.2 resulted in one cluster that included 33 participants and the second largest cluster included 5 Surprise MAE = 17.8 MAE = 7.0 MAE = 4.3 MAE = 17.3 MAE = 9.5 Human rankings Experiment 2 Negative effects Representativeness MAE = 2.3 MAE = 6.7 Experiment 1 Positive effects Relative utility MAE = 2.3 Human rankings Absolute utility Model rankings Model rankings Model rankings Model rankings Figure 3: Comparison between human rankings in both experiments and predicted rankings from four models. The solid diagonal lines indicate a perfect correspondence between human and model rankings. only 3 participants. Thus, we grouped all participants together and analyzed their mean rankings. The 0.2 threshold was chosen because it produced the most informative clustering in Experiment 2. Inverse decision-making models We implemented the inverse decision-making models using importance sampling with 5 million samples drawn from the prior distribution p(u). Because all the effects were positive, we used a prior on utilities that placed nearly all probability mass above zero (µ = 4, σ = 2). The mean human rankings are compared with the absolute utility model rankings in Figure 2a, and the mean human rankings are listed in order in 2c. Fractional rankings were used for both the human data and the model predictions. The human rankings in the figure are the means of participants’ fractional rankings. The first row of Figure 3 contains similar plots that allow comparison of the four models we considered. In these plots, the solid diagonal lines indicate a perfect correspondence between model and human rankings. Thus, the largest deviations from this line represent the largest deviations in the data from the model’s predictions. Figure 3 shows that the absolute and relative utility models make virtually identical predictions and both models provide a strong account of the human rankings as measured by mean absolute error (MAE = 2.3 in both cases). Moreover, both models correctly predict the highest ranked decision and the set of lowest ranked decisions. The only clear discrepancy between the model predictions and the data is the cluster of points at the lower left, labeled as Decisions 6–13 in Figure 2a. These are all cases in which effect x appears in all options and therefore these decisions provide no information about a decision maker’s preference for x. Consequently, the models assign the same ranking to this group as to the group of decisions in which there is only a single option (Decisions 1–5). Although people appeared to treat these groups somewhat differently, the models still correctly predict that the entire group of decisions 1–13 is ranked lower than all other decisions. The surprise and representativeness models do not perform nearly as well (MAE = 7.0 and 17.8, respectively). Although the surprise model captures some of the general trends in the human rankings, it makes several major errors. For example, consider Decision 7: dx|cx|bx|ax. This decision provides no information about a preference for x because it appears in every option. The decision is surprising, however, because a decision maker choosing at random from these options would make the observed choice only 1/4 of the time. The representativeness model performs even worse, primarily because it does not take into account alternative explanations for why an option was chosen, such as the fact that no other options were available (e.g., Decision 1 in Figure 2c). The failure of these models to adequately account for the data suggests that both the likelihood p(c|u, F) and marginal likelihood p(c|F) are important components of the absolute and relative utility models. Feature-based models We compared the performance of the absolute and relative utility models to our two feature-based models: the weighted feature and ranked feature models. For each participant, 6 (b) Ranked feature 10 10 5 Figure 4: Results of the feature-based model analysis from Experiment 1 for (a) the weighted feature models and (b) the ranked feature models. The histograms show the minimum number of features needed to match the accuracy (measured by MAE) of the absolute utility model for each participant. 15 5 1 2 3 4 5 6 >6 15 1 2 3 4 5 6 7 8 9 10 >10 Number of participants (a) Weighted feature Number of features needed we considered every subset of features1 in Figure 1d in order to determine the minimum number of features needed by the two models to achieve the same level of accuracy as the absolute utility model, as measured by mean absolute error. The results of these analyses are shown in Figure 4. For the majority of participants, at least four features were needed by both models to match the accuracy of the absolute utility model. For the weighted feature model, 14 participants could not be fit as well as the absolute utility model even when all ten features were considered. These results indicate that a feature-based account of people’s inferences in our task must be supplied with a relatively large number of features. By contrast, the inverse decision-making approach provides a relatively parsimonious account of the data. 4 Experiment 2: Negative effects Experiment 2 focused on a setting in which all effects are negative, motivated by the fact that the inverse decision-making models predict several major differences in orderings when effects are negative rather than positive. For instance, the absolute utility model’s relative rankings of the decisions in Figures 1a and 1b are reversed when all effects are negative rather than positive. Method 42 Carnegie Mellon undergraduates participated for course credit. The experimental design was identical to Experiment 1 except that participants were told that the effects were electric shocks at different body locations. They were asked to sort the cards on the basis of how strongly each decision suggested that the decision maker finds shocks at the target location relatively tolerable. The model predictions were derived in the same way as for Experiment 1, but with a prior distribution on utilities that placed nearly all probability mass below zero (µ = −4, σ = 2) to reflect the fact that effects were all negative. 4.1 Results Three participants were excluded as outliers by the same criterion applied in Experiment 1. The resulting mean rankings are compared with the corresponding rankings from Experiment 1 in Figure 2b. The figure shows that responses based on positive and negative effects were substantially different in a number of cases. Figure 3 shows how the mean rankings compare to the predictions of the four models we considered. Although the relative utility model is fairly accurate, no model achieves the same level of accuracy as the absolute and relative utility models in Experiment 1. In addition, the relative utility model provides a poor account of the responses of many individual participants. To better understand responses at the individual level, we repeated the hierarchical clustering analysis described in Experiment 1, which revealed that 29 participants could be grouped into one of four clusters, with the remaining participants each in their own clusters. We analyzed these four clusters independently, excluding the 10 participants that could not be naturally grouped. We compared the mean rankings of each cluster to the absolute and relative utility models, as well as all one- and two-feature weighted feature and ranked feature models. Figure 5 shows that the mean rankings of participants in Cluster 1 (N = 8) were best fit by the absolute utility model, the mean rankings of participants in Cluster 2 (N = 12) were best fit by the relative utility model, and the mean rankings of participants in Clusters 3 (N = 3) and 4 (N = 6) were better fit by feature-based models than by either the absolute or relative utility models. 1 A maximum of six features was considered for the ranked feature model because considering more features was computationally intractable. 7 Cluster 4 N =6 MAE = 4.9 MAE = 14.0 MAE = 7.9 MAE = 5.3 MAE = 2.6 MAE = 13.0 MAE = 6.2 Human rankings Relative utility Cluster 3 N =3 MAE = 2.6 Absolute utility Cluster 2 N = 12 Human rankings Cluster 1 N =8 Factors: 1,3 Factors: 1,8 MAE = 2.3 MAE = 5.2 Model rankings Best−fitting weighted feature Factors: 6,7 MAE = 4.0 Model rankings Model rankings Model rankings Human rankings Factors: 3,8 MAE = 4.8 Figure 5: Comparison between human rankings for four clusters of participants identified in Experiment 2 and predicted rankings from three models. Each point in the plots corresponds to one decision and the solid diagonal lines indicate a perfect correspondence between human and model rankings. The third row shows the predictions of the best-fitting two-factor weighted feature model for each cluster. The two factors listed refer to Figure 1d. To examine how well the models accounted for individuals’ rankings within each cluster, we compared the predictions of the inverse decision-making models to the best-fitting two-factor featurebased model for each participant. In Cluster 1, 7 out of 8 participants were best fit by the absolute utility model; in Cluster 2, 8 out of 12 participants were best fit by the relative utility model; in Clusters 3 and 4, all participants were better fit by feature-based models. No single feature-based model provided the best fit for more than two participants, suggesting that participants not fit well by the inverse decision-making models were not using a single alternative strategy. Applying the feature-based model analysis from Experiment 1 to the current results revealed that the weighted feature model required an average of 6.0 features to match the performance of the absolute utility model for participants in Cluster 1, and an average of 3.9 features to match the performance of the relative utility model for participants in Cluster 2. Thus, although a single model did not fit all participants well in the current experiment, many participants were fit well by one of the two inverse decision-making models, suggesting that this general approach is useful for explaining how people reason about negative effects as well as positive effects. 5 Conclusion In two experiments, we found that an inverse decision-making approach offered a good computational account of how people make judgments about others’ preferences. Although this approach is conceptually simple, our analyses indicated that it captures the influence of a fairly large number of relevant decision features. Indeed, the feature-based models that we considered as potential process models of preference learning could only match the performance of the inverse decision-making approach when supplied with a relatively large number of features. We feel that this result rules out the feature-based approach as psychologically implausible, meaning that alternative process-level accounts will need to be explored. One possibility is sampling, which has been proposed as a psychological mechanism for approximating probabilistic inferences [19, 20]. However, even if process models that use large numbers of features are considered plausible, the inverse decision-making approach provides a valuable computational-level account that helps to explain which decision features are informative. Acknowledgments This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers in Econometrics. Amademic Press, New York, 1973. [2] C. G. Lucas, T. L. Griffiths, F. Xu, and C. Fawcett. A rational model of preference learning and choice prediction by children. In Proceedings of Neural Information Processing Systems 21, 2009. [3] L. Bergen, O. R. Evans, and J. B. Tenenbaum. Learning structured preferences. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 2010. [4] A. Jern and C. Kemp. Decision factors that support preference learning. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011. [5] T. Kushnir, F. Xu, and H. M. Wellman. Young children use statistical sampling to infer the preferences of other people. Psychological Science, 21(8):1134–1140, 2010. [6] L. Ma and F. Xu. Young children’s use of statistical sampling evidence to infer the subjectivity of preferences. Cognition, in press. [7] M. J. Doherty. Theory of Mind: How Children Understand Others’ Thoughts and Feelings. Psychology Press, New York, 2009. [8] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75, Whole No. 517, 1961. [9] D. N. Osherson, E. E. Smith, O. Wilkie, A. L´ pez, and E. Shafir. Category-based induction. Psychological o Review, 97(2):185–200, 1990. [10] E. A. Wasserman, S. M. Elek, D. L. Chatlosh, and A. G. Baker. Rating causal relations: Role of probability in judgments of response-outcome contingency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1):174–188, 1993. [11] R. D. Luce. Individual choice behavior. John Wiley, 1959. [12] D. Ariely, G. Loewenstein, and D. Prelec. Tom Sawyer and the construction of value. Journal of Economic Behavior & Organization, 60:1–10, 2006. [13] D. Kahneman and A. Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology, 3(3):430–454, 1972. [14] D. Newtson. Dispositional inference from effects of actions: Effects chosen and effects forgone. Journal of Experimental Social Psychology, 10:489–496, 1974. [15] P. C. Fishburn. Lexicographic orders, utilities and decision rules: A survey. Management Science, 20(11):1442–1471, 1974. [16] G. Gigerenzer and P. M. Todd. Fast and frugal heuristics: The adaptive toolbox. Oxford University Press, New York, 1999. [17] A. Prince and P. Smolensky. Optimality Theory: Constraint Interaction in Generative Grammar. WileyBlackwell, 2004. [18] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. [19] A. N. Sanborn, T. L. Griffiths, and D. J. Navarro. Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review, 117:1144–1167, 2010. [20] L. Shi and T. L. Griffiths. Neural implementation of Bayesian inference by importance sampling. In Proceedings of Neural Information Processing Systems 22, 2009. 9

3 0.078212909 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

Author: Michael C. Mozer, Benjamin Link, Harold Pashler

Abstract: Psychologists have long been struck by individuals’ limitations in expressing their internal sensations, impressions, and evaluations via rating scales. Instead of using an absolute scale, individuals rely on reference points from recent experience. This relativity of judgment limits the informativeness of responses on surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that map stimuli to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, the problem is to infer latent (subjective) impressions from a sequence of stimulus labels (e.g., movie names) and responses. We describe an unsupervised approach that simultaneously recovers the impressions and parameters of a contamination model that predicts how recent judgments affect the current response. We test our iterated impression inference, or I3 , algorithm in three domains: rating the gap between dots, the desirability of a movie based on an advertisement, and the morality of an action. We demonstrate significant objective improvements in the quality of the recovered impressions. 1

4 0.055047102 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1

5 0.05474861 145 nips-2011-Learning Eigenvectors for Free

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

6 0.053982291 21 nips-2011-Active Learning with a Drifting Distribution

7 0.052652761 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

8 0.050850786 15 nips-2011-A rational model of causal inference with continuous causes

9 0.050781298 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

10 0.050354607 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

11 0.050317597 35 nips-2011-An ideal observer model for identifying the reference frame of objects

12 0.048187815 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

13 0.046100046 282 nips-2011-The Fast Convergence of Boosting

14 0.044093564 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

15 0.044044312 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

16 0.041719221 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.041337717 95 nips-2011-Fast and Accurate k-means For Large Datasets

18 0.041337065 202 nips-2011-On the Universality of Online Mirror Descent

19 0.040422849 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

20 0.038831122 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.118), (1, -0.041), (2, -0.008), (3, -0.008), (4, 0.001), (5, 0.012), (6, -0.003), (7, -0.044), (8, -0.037), (9, 0.012), (10, 0.019), (11, -0.008), (12, 0.031), (13, -0.012), (14, 0.123), (15, -0.023), (16, 0.032), (17, 0.005), (18, 0.094), (19, -0.049), (20, 0.033), (21, -0.022), (22, -0.002), (23, 0.003), (24, 0.064), (25, 0.089), (26, -0.039), (27, -0.075), (28, -0.02), (29, 0.061), (30, -0.046), (31, -0.025), (32, -0.065), (33, 0.046), (34, 0.024), (35, 0.046), (36, -0.045), (37, -0.052), (38, 0.03), (39, 0.009), (40, -0.092), (41, 0.063), (42, 0.077), (43, 0.018), (44, 0.017), (45, -0.056), (46, 0.055), (47, 0.017), (48, -0.016), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90208972 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. 1

2 0.71585411 90 nips-2011-Evaluating the inverse decision-making approach to preference learning

Author: Alan Jern, Christopher G. Lucas, Charles Kemp

Abstract: Psychologists have recently begun to develop computational accounts of how people infer others’ preferences from their behavior. The inverse decision-making approach proposes that people infer preferences by inverting a generative model of decision-making. Existing data sets, however, do not provide sufficient resolution to thoroughly evaluate this approach. We introduce a new preference learning task that provides a benchmark for evaluating computational accounts and use it to compare the inverse decision-making approach to a feature-based approach, which relies on a discriminative combination of decision features. Our data support the inverse decision-making approach to preference learning. A basic principle of decision-making is that knowing people’s preferences allows us to predict how they will behave: if you know your friend likes comedies and hates horror films, you can probably guess which of these options she will choose when she goes to the theater. Often, however, we do not know what other people like and we can only infer their preferences from their behavior. If you know that a different friend saw a comedy today, does that mean that he likes comedies in general? The conclusion you draw will likely depend on what else was playing and what movie choices he has made in the past. A goal for social cognition research is to develop a computational account of people’s ability to infer others’ preferences. One computational approach is based on inverse decision-making. This approach begins with a model of how someone’s preferences lead to a decision. Then, this model is inverted to determine the most likely preferences that motivated an observed decision. An alternative approach might simply learn a functional mapping between features of an observed decision and the preferences that motivated it. For instance, in your friend’s decision to see a comedy, perhaps the more movie options he turned down, the more likely it is that he has a true preference for comedies. The difference between the inverse decision-making approach and the feature-based approach maps onto the standard dichotomy between generative and discriminative models. Economists have developed an instance of the inverse decision-making approach known as the multinomial logit model [1] that has been widely used to infer consumer’s preferences from their choices. This model has recently been explored as a psychological model [2, 3, 4], but there are few behavioral data sets for evaluating it as a model of how people learn others’ preferences. Additionally, the data sets that do exist tend to be drawn from the developmental literature, which focuses on simple tasks that collect only one or two judgments from children [5, 6, 7]. The limitations of these data sets make it difficult to evaluate the multinomial logit model with respect to alternative accounts of preference learning like the feature-based approach. In this paper, we use data from a new experimental task that elicits a detailed set of preference judgments from a single participant in order to evaluate the predictions of several preference learning models from both the inverse decision-making and feature-based classes. Our task requires each participant to sort a large number of observed decisions on the basis of how strongly they indicate 1 (a) (b) (c) d c c (d) b b a a x d x d c b a x 1. Number of chosen effects (−/+) 2. Number of forgone effects (+/+) 3. Number of forgone options (+/+) 4. Number of forgone options containing x (−/−) 5. Max/min number of effects in a forgone option (+/−) 6. Is x in every option? (−/−) 7. Chose only option with x? (+/+) 8. Is x the only difference between options? (+/+) 9. Do all options have same number of effects? (+/+) 10. Chose option with max/min number of effects? (−/−) Figure 1: (a)–(c) Examples of the decisions used in the experiments. Each column represents one option and the boxes represent different effects. The chosen option is indicated by the black rectangle. (d) Features used by the weighted feature and ranked feature models. Features 5 and 10 involved maxima in Experiment 1, which focused on all positive effects, and minima in Experiment 2, which focused on all negative effects. The signs in parentheses indicate the direction of the feature that suggests a stronger preference in Experiment 1 / Experiment 2. a preference for a chosen item. Because the number of decisions is large and these decisions vary on multiple dimensions, predicting how people will order them offers a challenging benchmark on which to compare computational models of preference learning. Data sets from these sorts of detailed tasks have proved fruitful in other domains. For example, data reported by Shepard, Hovland, and Jenkins [8]; Osherson, Smith, Wilkie, L´ pez, and Shafir [9]; and Wasserman, Elek, Chatlosh, o and Baker [10] have motivated much subsequent research on category learning, inductive reasoning, and causal reasoning, respectively. We first describe our preference learning task in detail. We then present several inverse decisionmaking and feature-based models of preference learning and compare these models’ predictions to people’s judgments in two experiments. The data are well predicted by models that follow the inverse decision-making approach, suggesting that this computational approach may help explain how people learn others’ preferences. 1 Multi-attribute decisions and revealed preferences We designed a task that can be used to elicit a large number of preference judgments from a single participant. The task involves a set of observed multi-attribute decisions, some examples of which are represented visually in Figure 1. Each decision is among a set of options and each option produces a set of effects. Figure 1 shows several decisions involving a total of five effects distributed among up to five options. The differently colored boxes represent different effects and the chosen option is marked by a black rectangle. For example, 1a shows a choice between an option with four effects and an option with a single effect; here, the decision maker chose the second option. In our task, people are asked to rank a large number of these decisions by how strongly they suggest that the decision maker had a preference for a particular effect (e.g., effect x in Figure 1). By imposing some minimal constraints, the space of unique multi-attribute decisions is finite and we can obtain rankings for every decision in the space. For example, Figure 2c shows a complete list of 47 unique decisions involving up to five effects, subject to several constraints described later. Three of these decisions are shown in Figure 1. If all the effects are positive—pieces of candy, for example—the first decision (1a) suggests a strong preference for candy x, because the decision maker turned down four pieces in favor of one. The second decision (1b), however, offers much weaker evidence because nearly everyone would choose four pieces of candy over one, even without a specific preference for x. The third decision (1c) provides evidence that is strong but perhaps not quite as strong as the first decision. When all effects are negative—like electric shocks at different body locations—decision makers may still find some effects more tolerable than others, but different inferences are sometimes supported. For example, for negative effects, 1a provides weak evidence that x is relatively tolerable because nearly everyone would choose one shock over four. 2 A computational account of preference learning We now describe a simple computational model for learning a person’s preferences after observing that person make a decision like the ones in Figure 1. We assume that there are n available options 2 {o1 , . . . , on }, each of which produces one or more effects from the set {f1 , f2 , ..., fm }. For simplicity, we assume that effects are binary. Let ui denote the utility the decision maker assigns to effect fi . We begin by specifying a model of decision-making that makes the standard assumptions that decision makers tend to choose things with greater utility and that utilities are additive. That is, if fj is a binary vector indicating the effects produced by option oj and u is a vector of utilities assigned to each of the m effects, then the total utility associated with option oj can be expressed as Uj = fj T u. We complete the specification of the model by applying the Luce choice rule [11], a common psychological model of choice behavior, as the function that chooses among the options: p(c = oj |u, f ) = exp(Uj ) = exp(Uk ) n k=1 exp(fj T u) n T k=1 exp(fk u) (1) where c denotes the choice made. This model can predict the choice someone will make among a specified set of options, given the utilities that person assigns to the effects in each option. To obtain estimates of someone’s utilities, we invert this model by applying Bayes’ rule: p(u|c, F) = p(c|u, F)p(u) p(c|F) (2) where F = {f1 , . . . , fn } specifies the available options and their corresponding effects. This is the multinomial logit model [1], a standard econometric model. In order to apply Equation 2 we must specify a prior p(u) on the utilities. We adopt a standard approach that places independent Gaussian priors on the utilities: ui ∼ N (µ, σ 2 ). For decisions where effects are positive—like candies—we set µ = 2σ, which corresponds to a prior distribution that places approximately 2% of the probability mass below zero. Similarly, for negative effects—like electric shocks—we set µ = −2σ. 2.1 Ordering a set of observed decisions Equation 2 specifies a posterior probability distribution over utilities for a single observed decision but does not provide a way to compare the inferences drawn from multiple decisions for the purposes of ordering them. Suppose we are interested in a decision maker’s preference for effect x and we wish to order a set of decisions by how strongly they support this preference. Two criteria for ordering the decisions are as follows: Absolute utility Relative utility p(c|ux , F)p(ux ) p(c|F) p(c|∀j ux ≥ uj , F)p(∀j ux ≥ uj ) p(∀j ux ≥ uj |c, F) = p(c|F) E(ux |c, F) = Eux The absolute utility model orders decisions by the mean posterior utility for effect x. This criterion is perhaps the most natural way to assess how much a decision indicates a preference for x, but it requires an inference about the utility of x in isolation, and research suggests that people often think about the utility of an effect only in relation to other salient possibilities [12]. The relative utility model applies this idea to preference learning by ordering decisions based on how strongly they suggest that x has a greater utility than all other effects. The decisions in Figures 1b and 1c are cases where the two models lead to different predictions. If the effects are all negative (e.g., electric shocks), the absolute utility model predicts that 1b provides stronger evidence for a tolerance for x because the decision maker chose to receive four shocks instead of just one. The relative utility model predicts that 1c provides stronger evidence because 1b offers no way to determine the relative tolerance of the four chosen effects with respect to one another. Like all generative models, the absolute and relative models incorporate three qualitatively different components: the likelihood term p(c|u, F), the prior p(u), and the reciprocal of the marginal likelihood 1/p(c|F). We assume that the total number of effects is fixed in advance and, as a result, the prior term will be the same for all decisions that we consider. The two other components, however, will vary across decisions. The inverse decision-making approach predicts that both components should influence preference judgments, and we will test this prediction by comparing our 3 two inverse decision-making models to two alternatives that rely only one of these components as an ordering criterion: p(c|∀j ux ≥ uj , F) 1/p(c|F) Representativeness Surprise The representativeness model captures how likely the observed decision would be if the utility for x were high, and previous research has shown that people sometimes rely on a representativeness computation of this kind [13]. The surprise model captures how unexpected the observed decision is overall; surprising decisions may be best explained in terms of a strong preference for x, but unsurprising decisions provide little information about x in particular. 2.2 Feature-based models We also consider a class of feature-based models that use surface features to order decisions. The ten features that we consider are shown in Figure 1d, where x is the effect of interest. As an example, the first feature specifies the number of effects chosen; because x is always among the chosen effects, decisions where few or no other effects belong to the chosen option suggest the strongest preference for x (when all effects are positive). This and the second feature were previously identified by Newtson [14]; we included the eight additional features shown in Figure 1d in an attempt to include all possible features that seemed both simple and relevant. We consider two methods for combining this set of features to order a set of decisions by how strongly they suggest a preference for x. The first model is a standard linear regression model, which we refer to as the weighted feature model. The model learns a weight for each feature, and the rank of a given decision is determined by a weighted sum of its features. The second model is a ranked feature model that sorts the observed decisions with respect to a strict ranking of the features. The top-ranked feature corresponds to the primary sort key, the second-ranked feature to the secondary sort key, and so on. For example, suppose that the top-ranked feature is the number of chosen effects and the second-ranked feature is the number of forgone options. Sorting the three decisions in Figure 1 according to this criterion produces the following ordering: 1a,1c,1b. This notion of sorting items on the basis of ranked features has been applied before to decision-making [15, 16] and other domains of psychology [17], but we are not aware of any previous applications to preference learning. Although our inverse decision-making and feature-based models represent two very different approaches, both may turn out to be valuable. An inverse decision-making approach may be the appropriate account of preference learning at Marr’s [18] computational level, and a feature-based approach may capture the psychological processes by which the computational-level account is implemented. Our goal, therefore, is not necessarily to accept one of these approaches and dismiss the other. Instead, we entertain three distinct possibilities. First, both approaches may account well for the data, which would support the idea that they are valid accounts operating at different levels of analysis. Second, the inverse decision-making approach may offer a better account, suggesting that process-level accounts other than the feature-based approach should be explored. Finally, the feature-based approach may offer a better account, suggesting that inverse decision-making does not constitute an appropriate computational-level account of preference learning. 3 Experiment 1: Positive effects Our first experiment focuses on decisions involving only positive effects. The full set of 47 decisions we used is shown in Figure 2c. This set includes every possible unique decision with up to five different effects, subject to the following constraints: (1) one of the effects (effect x) must always appear in the chosen option, (2) there are no repeated options, (3) each effect may appear in an option at most once, (4) only effects in the chosen option may be repeated in other options, and (5) when effects appear in multiple options, the number of effects is held constant across options. The first constraint is necessary for the sorting task, the second two constraints create a finite space of decisions, and the final two constraints limit attention to what we deemed the most interesting cases. Method 43 Carnegie Mellon undergraduates participated for course credit. Each participant was given a set of cards, with one decision printed on each card. The decisions were represented visually 4 (a) (c) Decisions 42 40 45 Mean human rankings 38 30 23 20 22 17 13 12 11 10 9 8 7 6 19 18 31 34 28 21 26 36 35 33 37 27 29 32 25 24 16 15 14 5 4 3 2 1 Absolute utility model rankings (b) Mean human rankings (Experiment 1) 47 43 44 46 45 38 37 36 34 35 30 32 33 31 29 28 24 26 27 25 21 19 22 20 18 16 17 12 13 7 6 11 5 9 4 10 8 1 2 3 42 40 41 39 47 46 44 41 43 39 23 15 14 Mean human rankings (Experiment 2) 1. dcbax 2. cbax 3. bax 4. ax 5. x 6. dcax | bcax 7. dx | cx | bx | ax 8. cax | bax 9. bdx | bcx | bax 10. dcx | bax 11. bx | ax 12. bdx | cax | bax 13. cx | bx | ax 14. d | cbax 15. c | bax 16. b | ax 17. d | c | bax 18. dc | bax 19. c | b | ax 20. dc | bx | ax 21. bdc | bax 22. ad | cx | bx | ax 23. d | c | b | ax 24. bad | bcx | bax 25. ac | bx | ax 26. cb | ax 27. cbad | cbax 28. dc | b | ax 29. ad | ac | bx | ax 30. ab | ax 31. bad | bax 32. dc | ab | ax 33. dcb | ax 34. a | x 35. bad | bac | bax 36. ac | ab | ax 37. ad | ac | ab | ax 38. b | a | x 39. ba | x 40. c | b | a | x 41. cb | a | x 42. d | c | b | a | x 43. cba | x 44. dc | ba | x 45. dc | b | a | x 46. dcb | a | x 47. dcba | x Figure 2: (a) Comparison between the absolute utility model rankings and the mean human rankings for Experiment 1. Each point represents one decision, numbered with respect to the list in panel c. (b) Comparison between the mean human rankings in Experiments 1 and 2. In both scatter plots, the solid diagonal lines indicate a perfect correspondence between the two sets of rankings. (c) The complete set of decisions, ordered by the mean human rankings from Experiment 1. Options are separated by vertical bars and the chosen option is always at the far right. Participants were always asked about a preference for effect x. as in Figure 1 but without the letter labels. Participants were told that the effects were different types of candy and each option was a bag containing one or more pieces of candy. They were asked to sort the cards by how strongly each decision suggested that the decision maker liked a particular target candy, labeled x in Figure 2c. They sorted the cards freely on a table but reported their final rankings by writing them on a sheet of paper, from weakest to strongest evidence. They were instructed to order the cards as completely as possible, but were told that they could assign the same ranking to a set of cards if they believed those cards provided equal evidence. 3.1 Results Two participants were excluded as outliers based on the criterion that their rankings for at least five decisions were at least three standard deviations from the mean rankings. We performed a hierarchical clustering analysis of the remaining 41 participants’ rankings using rank correlation as a similarity metric. Participants’ rankings were highly correlated: cutting the resulting dendrogram at 0.2 resulted in one cluster that included 33 participants and the second largest cluster included 5 Surprise MAE = 17.8 MAE = 7.0 MAE = 4.3 MAE = 17.3 MAE = 9.5 Human rankings Experiment 2 Negative effects Representativeness MAE = 2.3 MAE = 6.7 Experiment 1 Positive effects Relative utility MAE = 2.3 Human rankings Absolute utility Model rankings Model rankings Model rankings Model rankings Figure 3: Comparison between human rankings in both experiments and predicted rankings from four models. The solid diagonal lines indicate a perfect correspondence between human and model rankings. only 3 participants. Thus, we grouped all participants together and analyzed their mean rankings. The 0.2 threshold was chosen because it produced the most informative clustering in Experiment 2. Inverse decision-making models We implemented the inverse decision-making models using importance sampling with 5 million samples drawn from the prior distribution p(u). Because all the effects were positive, we used a prior on utilities that placed nearly all probability mass above zero (µ = 4, σ = 2). The mean human rankings are compared with the absolute utility model rankings in Figure 2a, and the mean human rankings are listed in order in 2c. Fractional rankings were used for both the human data and the model predictions. The human rankings in the figure are the means of participants’ fractional rankings. The first row of Figure 3 contains similar plots that allow comparison of the four models we considered. In these plots, the solid diagonal lines indicate a perfect correspondence between model and human rankings. Thus, the largest deviations from this line represent the largest deviations in the data from the model’s predictions. Figure 3 shows that the absolute and relative utility models make virtually identical predictions and both models provide a strong account of the human rankings as measured by mean absolute error (MAE = 2.3 in both cases). Moreover, both models correctly predict the highest ranked decision and the set of lowest ranked decisions. The only clear discrepancy between the model predictions and the data is the cluster of points at the lower left, labeled as Decisions 6–13 in Figure 2a. These are all cases in which effect x appears in all options and therefore these decisions provide no information about a decision maker’s preference for x. Consequently, the models assign the same ranking to this group as to the group of decisions in which there is only a single option (Decisions 1–5). Although people appeared to treat these groups somewhat differently, the models still correctly predict that the entire group of decisions 1–13 is ranked lower than all other decisions. The surprise and representativeness models do not perform nearly as well (MAE = 7.0 and 17.8, respectively). Although the surprise model captures some of the general trends in the human rankings, it makes several major errors. For example, consider Decision 7: dx|cx|bx|ax. This decision provides no information about a preference for x because it appears in every option. The decision is surprising, however, because a decision maker choosing at random from these options would make the observed choice only 1/4 of the time. The representativeness model performs even worse, primarily because it does not take into account alternative explanations for why an option was chosen, such as the fact that no other options were available (e.g., Decision 1 in Figure 2c). The failure of these models to adequately account for the data suggests that both the likelihood p(c|u, F) and marginal likelihood p(c|F) are important components of the absolute and relative utility models. Feature-based models We compared the performance of the absolute and relative utility models to our two feature-based models: the weighted feature and ranked feature models. For each participant, 6 (b) Ranked feature 10 10 5 Figure 4: Results of the feature-based model analysis from Experiment 1 for (a) the weighted feature models and (b) the ranked feature models. The histograms show the minimum number of features needed to match the accuracy (measured by MAE) of the absolute utility model for each participant. 15 5 1 2 3 4 5 6 >6 15 1 2 3 4 5 6 7 8 9 10 >10 Number of participants (a) Weighted feature Number of features needed we considered every subset of features1 in Figure 1d in order to determine the minimum number of features needed by the two models to achieve the same level of accuracy as the absolute utility model, as measured by mean absolute error. The results of these analyses are shown in Figure 4. For the majority of participants, at least four features were needed by both models to match the accuracy of the absolute utility model. For the weighted feature model, 14 participants could not be fit as well as the absolute utility model even when all ten features were considered. These results indicate that a feature-based account of people’s inferences in our task must be supplied with a relatively large number of features. By contrast, the inverse decision-making approach provides a relatively parsimonious account of the data. 4 Experiment 2: Negative effects Experiment 2 focused on a setting in which all effects are negative, motivated by the fact that the inverse decision-making models predict several major differences in orderings when effects are negative rather than positive. For instance, the absolute utility model’s relative rankings of the decisions in Figures 1a and 1b are reversed when all effects are negative rather than positive. Method 42 Carnegie Mellon undergraduates participated for course credit. The experimental design was identical to Experiment 1 except that participants were told that the effects were electric shocks at different body locations. They were asked to sort the cards on the basis of how strongly each decision suggested that the decision maker finds shocks at the target location relatively tolerable. The model predictions were derived in the same way as for Experiment 1, but with a prior distribution on utilities that placed nearly all probability mass below zero (µ = −4, σ = 2) to reflect the fact that effects were all negative. 4.1 Results Three participants were excluded as outliers by the same criterion applied in Experiment 1. The resulting mean rankings are compared with the corresponding rankings from Experiment 1 in Figure 2b. The figure shows that responses based on positive and negative effects were substantially different in a number of cases. Figure 3 shows how the mean rankings compare to the predictions of the four models we considered. Although the relative utility model is fairly accurate, no model achieves the same level of accuracy as the absolute and relative utility models in Experiment 1. In addition, the relative utility model provides a poor account of the responses of many individual participants. To better understand responses at the individual level, we repeated the hierarchical clustering analysis described in Experiment 1, which revealed that 29 participants could be grouped into one of four clusters, with the remaining participants each in their own clusters. We analyzed these four clusters independently, excluding the 10 participants that could not be naturally grouped. We compared the mean rankings of each cluster to the absolute and relative utility models, as well as all one- and two-feature weighted feature and ranked feature models. Figure 5 shows that the mean rankings of participants in Cluster 1 (N = 8) were best fit by the absolute utility model, the mean rankings of participants in Cluster 2 (N = 12) were best fit by the relative utility model, and the mean rankings of participants in Clusters 3 (N = 3) and 4 (N = 6) were better fit by feature-based models than by either the absolute or relative utility models. 1 A maximum of six features was considered for the ranked feature model because considering more features was computationally intractable. 7 Cluster 4 N =6 MAE = 4.9 MAE = 14.0 MAE = 7.9 MAE = 5.3 MAE = 2.6 MAE = 13.0 MAE = 6.2 Human rankings Relative utility Cluster 3 N =3 MAE = 2.6 Absolute utility Cluster 2 N = 12 Human rankings Cluster 1 N =8 Factors: 1,3 Factors: 1,8 MAE = 2.3 MAE = 5.2 Model rankings Best−fitting weighted feature Factors: 6,7 MAE = 4.0 Model rankings Model rankings Model rankings Human rankings Factors: 3,8 MAE = 4.8 Figure 5: Comparison between human rankings for four clusters of participants identified in Experiment 2 and predicted rankings from three models. Each point in the plots corresponds to one decision and the solid diagonal lines indicate a perfect correspondence between human and model rankings. The third row shows the predictions of the best-fitting two-factor weighted feature model for each cluster. The two factors listed refer to Figure 1d. To examine how well the models accounted for individuals’ rankings within each cluster, we compared the predictions of the inverse decision-making models to the best-fitting two-factor featurebased model for each participant. In Cluster 1, 7 out of 8 participants were best fit by the absolute utility model; in Cluster 2, 8 out of 12 participants were best fit by the relative utility model; in Clusters 3 and 4, all participants were better fit by feature-based models. No single feature-based model provided the best fit for more than two participants, suggesting that participants not fit well by the inverse decision-making models were not using a single alternative strategy. Applying the feature-based model analysis from Experiment 1 to the current results revealed that the weighted feature model required an average of 6.0 features to match the performance of the absolute utility model for participants in Cluster 1, and an average of 3.9 features to match the performance of the relative utility model for participants in Cluster 2. Thus, although a single model did not fit all participants well in the current experiment, many participants were fit well by one of the two inverse decision-making models, suggesting that this general approach is useful for explaining how people reason about negative effects as well as positive effects. 5 Conclusion In two experiments, we found that an inverse decision-making approach offered a good computational account of how people make judgments about others’ preferences. Although this approach is conceptually simple, our analyses indicated that it captures the influence of a fairly large number of relevant decision features. Indeed, the feature-based models that we considered as potential process models of preference learning could only match the performance of the inverse decision-making approach when supplied with a relatively large number of features. We feel that this result rules out the feature-based approach as psychologically implausible, meaning that alternative process-level accounts will need to be explored. One possibility is sampling, which has been proposed as a psychological mechanism for approximating probabilistic inferences [19, 20]. However, even if process models that use large numbers of features are considered plausible, the inverse decision-making approach provides a valuable computational-level account that helps to explain which decision features are informative. Acknowledgments This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers in Econometrics. Amademic Press, New York, 1973. [2] C. G. Lucas, T. L. Griffiths, F. Xu, and C. Fawcett. A rational model of preference learning and choice prediction by children. In Proceedings of Neural Information Processing Systems 21, 2009. [3] L. Bergen, O. R. Evans, and J. B. Tenenbaum. Learning structured preferences. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 2010. [4] A. Jern and C. Kemp. Decision factors that support preference learning. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011. [5] T. Kushnir, F. Xu, and H. M. Wellman. Young children use statistical sampling to infer the preferences of other people. Psychological Science, 21(8):1134–1140, 2010. [6] L. Ma and F. Xu. Young children’s use of statistical sampling evidence to infer the subjectivity of preferences. Cognition, in press. [7] M. J. Doherty. Theory of Mind: How Children Understand Others’ Thoughts and Feelings. Psychology Press, New York, 2009. [8] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75, Whole No. 517, 1961. [9] D. N. Osherson, E. E. Smith, O. Wilkie, A. L´ pez, and E. Shafir. Category-based induction. Psychological o Review, 97(2):185–200, 1990. [10] E. A. Wasserman, S. M. Elek, D. L. Chatlosh, and A. G. Baker. Rating causal relations: Role of probability in judgments of response-outcome contingency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1):174–188, 1993. [11] R. D. Luce. Individual choice behavior. John Wiley, 1959. [12] D. Ariely, G. Loewenstein, and D. Prelec. Tom Sawyer and the construction of value. Journal of Economic Behavior & Organization, 60:1–10, 2006. [13] D. Kahneman and A. Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology, 3(3):430–454, 1972. [14] D. Newtson. Dispositional inference from effects of actions: Effects chosen and effects forgone. Journal of Experimental Social Psychology, 10:489–496, 1974. [15] P. C. Fishburn. Lexicographic orders, utilities and decision rules: A survey. Management Science, 20(11):1442–1471, 1974. [16] G. Gigerenzer and P. M. Todd. Fast and frugal heuristics: The adaptive toolbox. Oxford University Press, New York, 1999. [17] A. Prince and P. Smolensky. Optimality Theory: Constraint Interaction in Generative Grammar. WileyBlackwell, 2004. [18] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. [19] A. N. Sanborn, T. L. Griffiths, and D. J. Navarro. Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review, 117:1144–1167, 2010. [20] L. Shi and T. L. Griffiths. Neural implementation of Bayesian inference by importance sampling. In Proceedings of Neural Information Processing Systems 22, 2009. 9

3 0.70400935 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

Author: Michael C. Mozer, Benjamin Link, Harold Pashler

Abstract: Psychologists have long been struck by individuals’ limitations in expressing their internal sensations, impressions, and evaluations via rating scales. Instead of using an absolute scale, individuals rely on reference points from recent experience. This relativity of judgment limits the informativeness of responses on surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that map stimuli to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, the problem is to infer latent (subjective) impressions from a sequence of stimulus labels (e.g., movie names) and responses. We describe an unsupervised approach that simultaneously recovers the impressions and parameters of a contamination model that predicts how recent judgments affect the current response. We test our iterated impression inference, or I3 , algorithm in three domains: rating the gap between dots, the desirability of a movie based on an advertisement, and the morality of an action. We demonstrate significant objective improvements in the quality of the recovered impressions. 1

4 0.68278319 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning. 1

5 0.52584386 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database

Author: Joshua T. Abbott, Katherine A. Heller, Zoubin Ghahramani, Thomas L. Griffiths

Abstract: How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories. 1

6 0.45914876 15 nips-2011-A rational model of causal inference with continuous causes

7 0.45519683 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

8 0.43998045 218 nips-2011-Predicting Dynamic Difficulty

9 0.42047 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

10 0.40575731 41 nips-2011-Autonomous Learning of Action Models for Planning

11 0.39118344 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

12 0.3909815 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

13 0.37958869 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

14 0.37573192 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

15 0.36691079 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

16 0.35617989 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

17 0.35502419 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

18 0.34577978 256 nips-2011-Solving Decision Problems with Limited Information

19 0.34455726 130 nips-2011-Inductive reasoning about chimeric creatures

20 0.34450489 181 nips-2011-Multiple Instance Learning on Structured Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.036), (20, 0.025), (26, 0.037), (31, 0.09), (33, 0.021), (43, 0.052), (45, 0.077), (48, 0.012), (51, 0.356), (57, 0.031), (74, 0.061), (83, 0.061), (99, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75324911 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. 1

2 0.75011235 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron

Author: Ryota Kobayashi, Yasuhiro Tsubo, Petr Lansky, Shigeru Shinomoto

Abstract: State-of-the-art statistical methods in neuroscience have enabled us to fit mathematical models to experimental data and subsequently to infer the dynamics of hidden parameters underlying the observable phenomena. Here, we develop a Bayesian method for inferring the time-varying mean and variance of the synaptic input, along with the dynamics of each ion channel from a single voltage trace of a neuron. An estimation problem may be formulated on the basis of the state-space model with prior distributions that penalize large fluctuations in these parameters. After optimizing the hyperparameters by maximizing the marginal likelihood, the state-space model provides the time-varying parameters of the input signals and the ion channel states. The proposed method is tested not only on the simulated data from the Hodgkin−Huxley type models but also on experimental data obtained from a cortical slice in vitro. 1

3 0.63486791 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

Author: Sebastian A. Kurtek, Anuj Srivastava, Wei Wu

Abstract: While signal estimation under random amplitudes, phase shifts, and additive noise is studied frequently, the problem of estimating a deterministic signal under random time-warpings has been relatively unexplored. We present a novel framework for estimating the unknown signal that utilizes the action of the warping group to form an equivalence relation between signals. First, we derive an estimator for the equivalence class of the unknown signal using the notion of Karcher mean on the quotient space of equivalence classes. This step requires the use of Fisher-Rao Riemannian metric and a square-root representation of signals to enable computations of distances and means under this metric. Then, we define a notion of the center of a class and show that the center of the estimated class is a consistent estimator of the underlying unknown signal. This estimation algorithm has many applications: (1) registration/alignment of functional data, (2) separation of phase/amplitude components of functional data, (3) joint demodulation and carrier estimation, and (4) sparse modeling of functional data. Here we demonstrate only (1) and (2): Given signals are temporally aligned using nonlinear warpings and, thus, separated into their phase and amplitude components. The proposed method for signal alignment is shown to have state of the art performance using Berkeley growth, handwritten signatures, and neuroscience spike train data. 1

4 0.61505353 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

Abstract: We combine three important ideas present in previous work for building classifiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking classification results are obtained. 1

5 0.51816845 87 nips-2011-Energetically Optimal Action Potentials

Author: Martin B. Stemmler, Biswa Sengupta, Simon Laughlin, Jeremy Niven

Abstract: Most action potentials in the nervous system take on the form of strong, rapid, and brief voltage deflections known as spikes, in stark contrast to other action potentials, such as in the heart, that are characterized by broad voltage plateaus. We derive the shape of the neuronal action potential from first principles, by postulating that action potential generation is strongly constrained by the brain’s need to minimize energy expenditure. For a given height of an action potential, the least energy is consumed when the underlying currents obey the bang-bang principle: the currents giving rise to the spike should be intense, yet short-lived, yielding spikes with sharp onsets and offsets. Energy optimality predicts features in the biophysics that are not per se required for producing the characteristic neuronal action potential: sodium currents should be extraordinarily powerful and inactivate with voltage; both potassium and sodium currents should have kinetics that have a bell-shaped voltage-dependence; and the cooperative action of multiple ‘gates’ should start the flow of current. 1 The paradox Nerve cells communicate with each other over long distances using spike-like action potentials, which are brief electrical events traveling rapidly down axons and dendrites. Each action potential is caused by an accelerating influx of sodium or calcium ions, depolarizing the cell membrane by forty millivolts or more, followed by repolarization of the cell membrane caused by an efflux of potassium ions. As different species of ions are swapped across the membrane during the action potential, ion pumps shuttle the excess ions back and restore the ionic concentration gradients. If we label each ionic species by α, the work ∆E done to restore the ionic concentration gradients is [α] ∆E = RT V ∆[α]in ln out , (1) [α]in α where R is the gas constant, T is the temperature in Kelvin, V is the cell volume, [α]in|out is the concentration of ion α inside or outside the cell, and ∆[α]in is the concentration change inside the cell, which is assumed to be small relative to the total concentration. The sum α zα ∆[α] = 0, where zα is the charge on ion α, as no net charge accumulates during the action potential and no net work is done by or on the electric field. Often, sodium (Na+ ) and potassium (K+ ) play the dominant role in generating action potentials, in which case ∆E = ∆[Na]in F V(ENa − EK ), where F is Faraday’s constant, ENa = RT /F ln [Na]out /[Na]in is the reversal potential for Na+ , at which no net sodium current flows, and EK = RT /F ln [K]out /[K]in . This estimate of the work done does not include heat (due to loss through the membrane resistance) or the work done by the ion channel proteins in changing their conformational state during the action potential. Hence, the action potential’s energetic cost to the cell is directly proportional to ∆[Na]in ; taking into account that each Na+ ion carries one elementary charge, the cost is also proportional to the 1 charge QNa that accumulates inside the cell. A maximally efficient cell reduces the charge per spike to a minimum. If a cell fires action potentials at an average rate f , the cell’s Na/K pumps must move Na+ and K+ ions in opposite directions, against their respective concentration gradients, to counteract an average inward Na+ current of f QNa . Exhaustive measurements on myocytes in the heart, which expend tremendous amounts of energy to keep the heart beating, indicate that Na/K pumps expel ∼ 0.5 µA/cm2 of Na+ current at membrane potentials close to rest [1]. Most excitable cells, even when spiking, spend most of their time close to resting potential, and yet standard models for action potentials can easily lead to accumulating an ionic charge of up to 5 µC/cm2 [2]; most of this accumulation occurs during a very brief time interval. If one were to take an isopotential nerve cell with the same density of ion pumps as in the heart, then such a cell would not be able to produce more than an action potential once every ten seconds on average. The brain should be effectively silent. Clearly, this conflicts with what is known about the average firing rates of neurons in the brainstem or even the neocortex, which can sustain spiking up to at least 7 Hz [3]. Part of the discrepancy can be resolved by noting that nerve cells are not isopotential and that action potential generation occurs within a highly restricted area of the membrane. Even so, standard models of action potential generation waste extraordinary amounts of energy; recent evidence [4] points out that many mammalian cortical neurons are much more efficient. As nature places a premium on energy consumption, we will argue that one can predict both the shape of the action potential and the underlying biophysics of the nonlinear, voltage-dependent ionic conductances from the principle of minimal energy consumption. After reviewing the ionic basis of action potentials, we first sketch how to compute the minimal energy cost for an arbitrary spike shape, and then solve for the optimal action potential shape with a given height. Finally, we show how minimal energy consumption explains all the dynamical features in the standard HodgkinHuxley (HH) model for neuronal dynamics that distinguish the brain’s action potentials from other highly nonlinear oscillations in physics and chemistry. 2 Ionic basis of the action potential In an excitable cell, synaptic drive forces the membrane permeability to different ions to change rapidly in time, producing the dynamics of the action potential. The current density Iα carried by an ion species α is given by the Goldman-Hodgkin-Katz (GHK) current equation[5, 6, 2], which assumes that ions are driven independently across the membrane under the influence of a constant electric field. Iα depends upon the ions membrane permeability, Pα , its concentrations on either side of the membrane [α]out and [α]in and the voltage across the membrane, V , according to: Iα = Pα 2 zα V F 2 [α]out − [α]in exp (zα V F/RT ) , RT 1 − exp(zα V F/RT ) (2) To produce the fast currents that generate APs, a subset of the membranes ionic permeabilities Pα are gated by voltage. Changes in the permeability Pα are not instantaneous; the voltage-gated permeability is scaled mathematically by gating variables m(t) and h(t) with their own time dependence. After separating constant from time-dependent components in the permeability, the voltage-gated permeability obeys ¯ Pα (t) = m(t)r h(t)s such that 0 ≤ Pα (t) ≤ Pα , ¯ where r and s are positive, and Pα is the peak permeability to ion α when all channels for ion α are open. Gating is also referred to as activation, and the associated nonlinear permeabilities are called active. There are also passive, voltage-insensitive permeabilities that maintain the resting potential and depolarise the membrane to trigger action potentials. The simplest possible kinetics for the gating variables are first order, involving only a single derivative in time. The steady state of each gating variable at a given voltage is determined by a Boltzmann function, to which the gating variables evolve: dm r ¯ τm = Pα m∞ (V ) − m(t) dt dh and τh =h∞ (V ) − h(t), dt 2 −1 with m∞ (V ) = {1 + exp ((V − Vm )/sm )} the Boltzmann function described by the slope sm > −1 0 and the midpoint Vm ; similarly, h∞ (V ) = {1 + exp ((V − Vh )/sh )} , but with sh < 0. Scaling ¯ m∞ (V ) by the rth root of the peak permeability Pα is a matter of mathematical convenience. We will consider both voltage-independent and voltage-dependent time constants, either setting τj = τj,0 to be constant, where j ∈ {m(t), h(t)}, or imposing a bell-shaped voltage dependence τj (V ) = τj,0 sech [sj (V − Vj )] The synaptic, leak, and voltage-dependent currents drive the rate of change in the voltage across the membrane dV C = Isyn + Ileak + Iα , dt α where the synaptic permeability and leak permeability are held constant. 3 Resistive and capacitive components of the energy cost By treating the action potential as the charging and discharging of the cell membrane capacitance, the action potentials measured at the mossy fibre synapse in rats [4] or in mouse thalamocortical neurons [7] were found to be highly energy-efficient: the nonlinear, active conductances inject only slightly more current than is needed to charge a capacitor to the peak voltage of the action potential. The implicit assumption made here is that one can neglect the passive loss of current through the membrane resistance, known as the leak. Any passive loss must be compensated by additional charge, making this loss the primary target of the selection pressure that has shaped the dynamics of action potentials. On the other hand, the membrane capacitance at the site of AP initiation is generally modelled and experimentally confirmed [8] as being fairly constant around 1 µF/cm2 ; in contrast, the propagation, but not generation, of AP’s can be assisted by a reduction in the capacitance achieved by the myelin sheath that wraps some axons. As myelin would block the flow of ions, we posit that the specific capacitance cannot yield to selection pressure to minimise the work W = QNa (ENa − EK ) needed for AP generation. To address how the shape and dynamics of action potentials might have evolved to consume less energy, we first fix the action potential’s shape and solve for the minimum charge QNa ab initio, without treating the cell membrane as a pure capacitor. Regardless of the action potential’s particular time-course V (t), voltage-dependent ionic conductances must transfer Na+ and K+ charge to elicit an action potential. Figure 1 shows a generic action potential and the associated ionic currents, comparing the latter to the minimal currents required. The passive equivalent circuit for the neuron consists of a resistor in parallel with a capacitor, driven by a synaptic current. To charge the membrane to the peak voltage, a neuron in a high-conductance state [9, 10] may well lose more charge through the resistor than is stored on the capacitor. For neurons in a low-conductance state and for rapid voltage deflections from the resting potential, membrane capacitance will be the primary determinant of the charge. 4 The norm of spikes How close can voltage-gated channels with realistic properties come to the minimal currents? What time-course for the action potential leads to the smallest minimal currents? To answer these questions, we must solve a constrained optimization problem on the solutions to the nonlinear differential equations for the neuronal dynamics. To separate action potentials from mere small-amplitude oscillations in the voltage, we need to introduce a metric. Smaller action potentials consume less energy, provided the underlying currents are optimal, yet signalling between neurons depends on the action potential’s voltage deflection reaching a minimum amplitude. Given the importance of the action potential’s amplitude, we define an Lp norm on the voltage wave-form V (t) to emphasize the maximal voltage deflection: 1 p T V (t) − V p V (t) − V = 0 3 p dt , Generic Action Potential -10 + a V [mV] -20 -30 gsyn -40 -50 -60 0 2 4 6 8 t [ms] 10 12 14 16 gNa Active and Minimal Currents 100 gK + gleak C + + 80 2 current [µA/cm ] 60 b Active IK Minimum IK 40 20 0 -20 For a fixed action potential waveform V (t): Active INa Minimum INa -40 -60 Minimum INa (t) = −LV (t)θ(LV (t)) Minimum IK (t) = −LV (t)θ(−LV (t)) -80 -100 0 2 4 6 8 10 t [ms] 12 14 ˙ with LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)]. 16 c Qresistive/Qcapacitive Resistive vs. Capacitive Minimum Charge 1 0.5 0 0.2 0.4 0.6 0.8 1.0 1.2 leak conductance [mS/cm2] 1.4 Figure 1: To generate an action potential with an arbitrary time-course V (t), the nonlinear, timedependent permeabilities must deliver more charge than just to load the membrane capacitance— resistive losses must be compensated. (a) The action potential’s time-course in a generic HH model for a neuron, represented by the circuit diagram on the right. The peak of the action potential is ∼ 50 mV above the average potential. (b) The inward Na+ current, shown in green going in the negative direction, rapidly depolarizes the potential V (t) and yields the upstroke of the action potential. Concurrently, the K+ current activates, displayed as a positive deflection, and leads to the downstroke in the potential V (t). Inward and outward currents overlap significantly in time. The dotted lines within the region bounded by the solid lines represent the minimal Na+ current and the minimal K+ current needed to produce the V (t) spike waveform in (a). By the law of current conservation, the sum of capacitive, resistive, and synaptic currents, denoted by ˙ LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)], must be balanced by the active currents. If the cell’s passive properties, namely its capacitance and (leak) resistance, and the synaptic conductance are constant, we can deduce the minimal active currents needed to generate a specified V (t). The minimal currents, by definition, do not overlap in time. Taking into account passive current flow, restoring the concentration gradients after the action potential requires 29 nJ/cm2 . By contrast, if the active currents were optimal, the cost would be 8.9 nJ/cm2 . (c) To depolarize from the minimum to the maximum of the AP, the synaptic voltage-gated currents must deliver a charge Qcapacitive to charge the membrane capacitance and a charge Qresistive to compensate for the loss of current through leak channels. For a large leak conductance in the cell membrane, Qresistive can be larger than Qcapacitive . 4 where V is the average voltage. In the limit as p → ∞, the norm simply becomes the difference between the action potential’s peak voltage and the mean voltage, whereas a finite p ensures that the norm is differentiable. In parameter space, we will focus our attention to the manifold of action potentials with constant Lp norm with 2 p < ∞, which entails that the optimal action potential will have a finite, though possibly narrow width. To be close to the supremum norm, yet still have a norm that is well-behaved under differentiation, we decided to use p = 16. 5 Poincar´ -Lindstedt perturbation of periodic dynamical orbits e Standard (secular) perturbation theory diverges for periodic orbits, so we apply the PoincarLindstedt technique of expanding both in the period and the dynamics of the asymptotic orbit and then derive a set of adjoint sensitivity equations for the differential-algebraic system. Solving once for the adjoint functions, we can easily compute the parameter gradient of any functional on the orbit, even for thousands of parameters. ˙ We start with a set of ordinary differential equations x = F(x; p) for the neuron’s dynamics, an asymptotically periodic orbit xγ (t) that describes the action potential, and a functional G(x; p) on the orbit, representing the energy consumption, for instance. The functional can be written as an integral ω(p)−1 G(xγ ; p) = g(xγ (t); p) dt, 0 over some source term g(xγ (t); p). Assume that locally perturbing a parameter p ∈ p induces a smooth change in the stable limit cycle, preserving its existence. Generally, a perturbation changes not only the limit cycle’s path in state space, but also the average speed with which this orbit is traversed; as a consequence, the value of the functional depends on this change in speed, to lowest order. For simplicity, consider a single, scalar parameter p. G(xγ ; p) is the solution to ω(p)∂τ [G(xγ ; p)] = g(xγ ; p), where we have normalised time via τ = ω(p)t. Denoting partial derivatives by subscripts, we expand p → p + to get the O 1 equation dτ [Gp (xγ ; p)] + ωp g(xγ ; p) = gx (xγ ; p)xp + gp (xγ ; p) in a procedure known as the Poincar´ -Lindstedt method. Hence, e dG = dp ω −1 (gp + gx xp − ωp g) dt, 0 where, once again by the Poincar´ -Lindstedt method, xp is the solution to e ˙ xp =Fx (xγ )xp + Fp (xγ ) − ωp F (xγ ) . Following the approach described by Cao, Li, Petzold, and Serban (2003), introduce a Lagrange vector AG (x) and consider the augmented objective function ω −1 I(xγ ; p) = G(xγ ; p) − ˙ AG (xγ ). (F(xγ ) − xγ ) dt, 0 γ ˙ which is identical to G(x ; p) as F(x) − x = 0. Then dI(xγ ; p) = dp ω −1 ω −1 ˙ AG . (Fp + Fx xp − ωp F − xp ) dt. (gp + gx xp − ωp g) dt − 0 0 ˙ Integrating the AG (x).xp term by parts and using periodicity, we get dI(xγ ; p) = dp ω −1 ω −1 ˙ −gx + AG + AG .F xp dt. G gp − ωp g − A . (Fp − ωp F) dt − 0 0 5 Parameter ¯ peak permeability PNa ¯ peak permeability PK midpoint voltage Vm ∨ Vh slope sm ∨ (−sh ) time constant τm,0 ∨ τh,0 gating exponent r ∨ s minimum 0.24 fm/s 6.6 fm/s - 72 mV 3.33 mV 5 µs 0.2 maximum 0.15 µm/s 11 µm/s 70 mV 200 mV 200 ms 5.0 Table 1: Parameter limits. We can let the second term vanish by making the vector AG (x) obey ˙ AG (x) = −FT (x; p) AG (x) + gx (x; p). x Label the homogeneous solution (obtained by setting gx (xγ ; p) = 0) as Z(x). It is known that ω −1 the term ωp is given by ωp = ω 0 Z(x).Fp (x) dt, provided Z(x) is normalised to satisfy Z(x).F(x) = 1. We can add any multiple of the homogeneous solution Z(x) to the inhomogeneous solution, so we can always make ω −1 AG (x).F(x) dt = G 0 by taking ω −1 G G AG (x).F(x) dt − ωG . A (x) → A (x) − Z(x) (3) 0 This condition will make AG (x) unique. Finally, with eq. (3) we get dI(xγ ; p) dG(xγ ; p) = = dp dp ω −1 gp − AG . Fp dt. 0 The first term in the integral gives rise to the partial derivative ∂G(xγ ; p)/ ∂p. In many cases, this term is either zero, can be made zero, or at least made independent of the dynamical variables. The parameters for the neuron models are listed in Table 1 together with their minimum and maximum allowed values. For each parameter in the neuron model, an auxiliary parameter on the entire real line is introduced, and a mapping from the real line onto the finite range set by the biophysical limits is defined. Gradient descent on this auxiliary parameter space is performed by orthogonalizing the gradient dQα /dp to the gradient dL/dp of the norm. To correct for drift off the constraint manifold of constant norm, illustrated in Fig. 3, steps of gradient ascent or descent on the Lp norm are performed while keeping Qα constant. The step size during gradient descent is adjusted to assure that ∆Qα < 0 and that a periodic solution xγ exists after adapting the parameters. The energy landscape is locally convex (Fig. 3). 6 Predicting the Hodgkin-Huxley model We start with a single-compartment Goldman-Hodgkin-Katz model neuron containing voltage-gated Na+ and leak conductances (Figure 1). A tonic synaptic input to the model evokes repetitive firing of action potentials. We seek those parameters that minimize the ionic load for an action potential of constant norm—in other words, spikes whose height relative to the average voltage is fairly constant, subject to a trade-off with the spike width. The ionic load is directly proportional to the work W performed by the ion flux. All parameters governing the ion channels’ voltage dependence and kinetics, including their time constants, mid-points, slopes, and peak values, are subject to change. The simplest model capable of generating an action potential must have two dynamical variables and two time scales: one for the upstroke and another for the downstroke. If both Na+ and K+ currents 6 Transient Na Current Model Optimal Action Potential Falling Phase Currents 40 20 a τ [ms] 5 1 2 Q = 239 nC/cm PNa = m(t)h(t) PK = n(t) 0 -60 0 -20 τh τn 60 current [μA/cm2] V [mV] V [mV] -40 IK[V] Excess INa[V] Peak Resurgence 300 200 100 -60 -4 -2 0 0 4 40 20 τ [ms] 5 1 Q = 169 nC/cm2 PNa = m(t)h(t) PK = n(t) τi = τi(V) 0 -60 0 -20 τh τn 60 current [μA/cm2] 60 V [mV] -40 -60 -4 -2 0 2 t [ms] 0.5 0.75 IK[V] Excess INa[V] Peak Resurgence 200 100 0 4 0.25 40 5 1 PNa = m(t)h(t) s PK = n(t) τi = τi(V) 20 0 delay τ [ms] Q = 156 nC/cm2 current [μA/cm2] 60 -60 0 -20 τh τn 60 V [mV] -40 -60 t [ms] 0.5 t [ms] 0.75 Cooperative Gating Model Optimal Action Potential Falling Phase Currents V [mV] c 0.25 Voltage-dependent (In)activation Model Falling Phase Currents Optimal Action Potential V [mV] b 2 t [ms] -2 -1 0 t [ms] 1 750 500 250 0 0 2 IK[V] Excess INa[V] Peak Resurgence 0.2 t [ms] 0.4 Figure 2: Optimal spike shapes and currents for neuron models with different biophysical features. During optimization, the spikes were constrained to have constant norm V (t) − V 16 = 92 mV, which controls the height of the spike. Insets in the left column display the voltage-dependence of the optimized time constants for sodium inactivation and potassium activation; sodium activation is modeled as occurring instantaneously. (a) Model with voltage-dependent inactivation of Na+ ; time constants for the first order permeability kinetics are voltage-independent (inset). Inactivation turns off the Na+ current on the downstroke, but not completely: as the K+ current activates to repolarize the membrane, the inward Na+ current reactivates and counteracts the K+ current; the peak of the resurgent Na+ current is marked by a triangle. (b) Model with voltage-dependent time constants for the first order kinetics of activation and inactivation. The voltage dependence minimizes the resurgence of the Na+ current. (c) Power-law gating model with an inwardly rectifying potassium current replacing the leak current. The power law dependence introduces an effective delay in the onset of the K+ current, which further minimizes the overlap of Na+ and K+ currents in time. 7 Energy per Spike Surface of Constant Norm Spikes ya 10 16 V [mV] K 18 10 12 14 14 16 T b V [mV] 0 t [ms] 2 10 18 16 12 s [mV] K V [mV] K T a V [mV] 12 nJ/cm2 ≥ 16.5 16.3 16.3 yc 16 sK [mV] 100 0 -2 16.4 T c 100 0 -2 V [mV] 14 14 VE [nJ/cm2] yc ya τK [ms] yb 12 16.5 yb 20 0 t [ms] 2 100 0 -2 0 t [ms] 2 Figure 3: The energy required for an action potential three parameters governing potassium activation: the midpoint voltage VK , the slope sK , and the (maximum) time constant τK . The energy is the minimum work required to restore the ionic concentration gradients, as given by Eq. (1). Note that the energy within the constrained manifold of constant norm spikes is locally convex. are persistent, current flows in opposite directions at the same time, so that, even at the optimum, the ionic load is 1200 nC/cm2 . On the other hand, no voltage-gated K+ channels are even required for a spike, as long as Na+ channels activate on a fast time scale and inactivate on a slower time scale and the leak is powerful enough to repolarize the neuron. Even so, the load is still 520 nC/cm2 . While spikes require dynamics on two time scales, suppressing the overlap between inward and outward currents calls for a third time scale. The resulting dynamics are higher-dimensional and reduce the load to to 239 nC/cm2 . Making the activation and inactivation time constants voltage-dependent permits ion channels to latch to an open or closed state during the rising and falling phase of the spike, reducing the ionic load to 189 nC/cm2 (Fig. 2) . The minimal Na+ and K+ currents are separated in time, yet dynamics that are linear in the activation variables cannot enforce a true delay between the offset of the Na+ current and the onset of the K+ current. If current flow depends on multiple gates that need to be activated simultaneously, optimization can use the nonlinearity of multiplication to introduce a delay in the rise of the K+ current that abolishes the overlap, and the ionic load drops to 156 nC/cm2 . Any number of kinetic schemes for the nonlinear permeabilities Pα can give rise to the same spike waveform V (t), including the simplest two-dimensional one. Yet only the full Hodgkin-Huxley (HH) model, with its voltage-dependent kinetics that prevent the premature resurgence of inward current and cooperative gating that delays the onset of the outward current, minimizes the energetic cost. More complex models, in which voltage-dependent ion channels make transitions between multiple closed, inactivated, and open states, instantiate the energy-conserving features of the HH system at the molecular level. Furthermore, features that are eliminated during optimization, such as a voltage-dependent inactivation of the outward potassium current, are also not part of the delayed rectifier potassium current in the Hodgkin-Huxley framework. 8 References [1] Paul De Weer, David C. Gadsby, and R. F. Rakowski. Voltage dependence of the na-k pump. Ann. Rev. Physiol., 50:225–241, 1988. [2] B. Frankenhaeuser and A. F. Huxley. The action potential in the myelinated nerve fibre of xenopus laevis as computed on the basis of voltage clamp data. J. Physiol., 171:302–315, 1964. [3] Samuel S.-H. Wang, Jennifer R. Shultz, Mark J. Burish, Kimberly H. Harrison, Patrick R. Hof, Lex C. Towns, Matthew W. Wagers, and Krysta D. Wyatt. Functional trade-offs in white matter axonal scaling. J. Neurosci., 28(15):4047–4056, 2008. [4] Henrik Alle, Arnd Roth, and J¨ rg R. P. Geiger. Energy-efficient action potentials in hippocamo pal mossy fibers. Science, 325(5946):1405–1408, 2009. [5] D. E. Goldman. Potential, impedance and rectification in membranes. J. Gen. Physiol., 27:37– 60, 1943. [6] A. L. Hodgkin and B. Katz. The effect of sodium ions on the electrical activity of the giant axon of the squid. J. Physiol., 108:37–77, 1949. [7] Brett C. Carter and Bruce P. Bean. Sodium entry during action potentials of mammalian neurons: Incomplete inactivation and reduced metabolic efficiency in fast-spiking neurons. Neuron, 64(6):898–909, 2009. [8] Luc J. Gentet, Greg J. Stuart, and John D. Clements. Direct measurement of specific membrane capacitance in neurons. Biophys. J., 79:314–320, 2000. [9] Alain Destexhe, Michael Rudolph, and Denis Par´ . The high-conductance state of neocortical e neurons in vivo. Nature Neurosci. Rev., 4:739–751, 2003. [10] Bilal Haider and David A. McCormick. Rapid neocortical dynamics: Cellular and network mechanisms. Neuron, 62:171–189, 2009. 9

6 0.41993114 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

7 0.41549203 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

8 0.41213 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

9 0.41130763 258 nips-2011-Sparse Bayesian Multi-Task Learning

10 0.41100278 158 nips-2011-Learning unbelievable probabilities

11 0.41064584 229 nips-2011-Query-Aware MCMC

12 0.40988109 249 nips-2011-Sequence learning with hidden units in spiking neural networks

13 0.40970829 75 nips-2011-Dynamical segmentation of single trials from population neural data

14 0.40812477 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

15 0.40784219 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

16 0.40749598 68 nips-2011-Demixed Principal Component Analysis

17 0.40734607 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

18 0.40649268 219 nips-2011-Predicting response time and error rates in visual search

19 0.40632772 276 nips-2011-Structured sparse coding via lateral inhibition

20 0.40599114 180 nips-2011-Multiple Instance Filtering