nips nips2008 nips2008-224 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jim C. Huang, Brendan J. Frey
Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. [sent-8, score-0.597]
2 In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. [sent-9, score-0.634]
3 We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. [sent-11, score-0.146]
4 We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. [sent-12, score-0.293]
5 1 Introduction Ranking is the central problem for many information retrieval applications such as web search, collaborative filtering and document retrieval [8]. [sent-13, score-0.125]
6 In these problems, we are given a set of objects to be ranked and a series of observations where each observation consists of some subset of the objects, a feature vector and some ordering of the objects with highly ranked objects corresponding to a higher relevance or degree of importance. [sent-14, score-0.635]
7 The goal is to then learn a model which allows us to assign a score to new test objects: this often takes the form of a ranking function [2, 4] which assigns a higher score to objects with higher rankings. [sent-15, score-0.345]
8 This requires measures of loss which are multivariate and structured. [sent-17, score-0.106]
9 However, such ranking measures are typically difficult to optimize directly [3], making the problem of learning difficult. [sent-18, score-0.255]
10 A previous approach has been to treat the problem as one of structured prediction [7], where the aim is to directly optimize ranking measures. [sent-19, score-0.362]
11 Another approach has been to approximate these ranking measures with smooth differentiable loss functionals by formulating probabilistic models on pairwise preferences between objects (RankNet; [2]), or on ordered lists of objects (ListNet and ListMLE; [4, 13]). [sent-20, score-0.895]
12 In practice, these methods either require approximating a learning problem with an intractable number of constraints, or they require observations containing complete orderings over the objects to be ranked or one must make independence assumptions on pairwise preferences. [sent-21, score-0.458]
13 We will show that 1) a probability over orderings is equivalent to a probability over pairwise inequalities between objects to be ranked and 2) this amounts to specifying a joint cumulative distribution function (CDF) over pairwise object preferences. [sent-23, score-0.648]
14 We will present a framework for ranking using the recently-developed probabilistic graphical modelling framework of CDNs which compactly represents this joint CDF as a product of local functions [5]. [sent-24, score-0.504]
15 While the problem of inference in CDNs was addressed in [5], here we address the problem of learning in CDNs in the context of ranking learning where we estimate model parameters under a structured loss functional that accounts for dependencies between pairwise object preferences. [sent-25, score-0.668]
16 We will then test the proposed framework on the OHSUMED dataset [8], a benchmark dataset used in information retrieval research. [sent-26, score-0.169]
17 Finally we will show that the frameworks proposed by [2, 4, 13] can be viewed as particular types of CDNs so that novel classes of flexible structured loss functionals for ranking learning can be specified under our framework. [sent-27, score-0.517]
18 2 Cumulative distribution networks The CDN [5] is an undirected graphical model in which the joint CDF F (z) over a set of random variables is represented as a product over functions defined over subsets of these variables. [sent-28, score-0.126]
19 An example of a CDN is shown in Figure 1(a), along with an example bivariate density which can be obtained by differentiating a product of 2 Gaussian CDF functions (Figure 1(b)). [sent-30, score-0.11]
20 Thus, in order for the CDN to represent a valid CDF, it is sufficient that each of the local functions φc satisfy all of the properties of a multivariate CDF. [sent-32, score-0.083]
21 In a CDN, disjoint sets of variables A, B are marginally independent if they share no functions in common, and disjoint sets of variables A, B are conditionally independent given variable set C if no path linking any variable in A to any variable in B passes through C. [sent-34, score-0.134]
22 In addition, marginalization of variables in a CDN can be done in constant-time via a trivial maximization of the joint CDF with respect to the variables being marginalized. [sent-35, score-0.107]
23 The CDN framework provides us with a means to compactly represent multivariate joint CDFs over many variables: in the next section we will formulate a loss functional for learning to rank which takes on such a form. [sent-38, score-0.377]
24 3 Structured loss functionals for ranking learning We now proceed to formulate the problem of learning to rank in a structured setting. [sent-40, score-0.579]
25 Suppose we wish to rank N nodes in the set V = {V1 , · · · , VN } and we are given a set of observations D1 , · · · , DT . [sent-41, score-0.222]
26 Each observation Dt consists of an ordering over the nodes in a subset Vt ⊆ V, where each node is provided with a corresponding feature vector x ∈ RL which may be specific to the given observation. [sent-42, score-0.401]
27 The orderings could be provided in the form of ordinal node labels1 , or in the form of pairwise node preferences. [sent-43, score-0.499]
28 The orderings can be represented as a directed graph over the nodes in which a directed edge e = (Vi → Vj ) is drawn between 2 nodes Vi , Vj iff Vi is preferred to node Vj , which we denote as Vi Vj . [sent-44, score-0.564]
29 In general, we assume that for any given observation, we observe a partial ordering over nodes, with complete orderings being a special case. [sent-45, score-0.244]
30 We denote the above graph consisting of edges e = (Vi → Vj ) ∈ Et and the node set Vt as the order graph Gt = (Vt , Et ) for observation Dt so that Dt = {Gt , {xt }Vn ∈Vt }. [sent-46, score-0.334]
31 A toy example of an observation n over 4 nodes is shown in Figure 2(a). [sent-47, score-0.166]
32 Note that under this framework, the absence of an edge between two nodes Vi , Vj in the order graph indicates we cannot assert any preference between the two nodes for the given observation. [sent-48, score-0.49]
33 (a) (b) Figure 2: a) An example of an order graph over 4 nodes V1 , V2 , V3 , V4 corresponding to the objects to be ranked. [sent-49, score-0.3]
34 The graph represents the set of preference relationships V1 V2 , V 1 V3 , V 1 V4 , V 2 V4 , V 3 V4 ; b) Learning the ranking function from training data. [sent-50, score-0.472]
35 The training data consists of a set of order graphs over subsets of the objects to be ranked. [sent-51, score-0.154]
36 For order graph, the ranking function ρ maps each node to the real line . [sent-52, score-0.369]
37 The goal is to learn ρ such that we minimize our probability of misranking on test observations. [sent-53, score-0.06]
38 We now define ρ : V → R as a ranking function which assigns scores to nodes via their feature vectors so that for node Vi , Si = ρ(Vi ) + πi (2) where Si is a scalar and πi is a random variable specific to node Vi . [sent-54, score-0.587]
39 We wish to learn such a function given multiple observations D1 , · · · , DT so that we minimize the probability of misranking on test observations (Figure 2(b)). [sent-55, score-0.114]
40 The above model allows us to account for the fact that the amount of uncertainty about a node’s rank may depend on unobserved features for that node (e. [sent-56, score-0.18]
41 Under this model, the preference relation Vi Vj is completely equivalent to ρ(Vi ) + πi ≥ ρ(Vj ) + πj ⇔ πij = πj − πi ≤ ρ(Vi ) − ρ(Vj ). [sent-59, score-0.136]
42 (3) where we have defined πij as a preference variable between nodes Vi , Vj . [sent-60, score-0.244]
43 For each edge e = (Vi → Vj ) ∈ Et in the order graph, we can define r(ρ; e, Dt ) ≡ ρ(Vi ) − ρ(Vj ) and collect these into the vector r(ρ; Gt ) ∈ R|Et | . [sent-61, score-0.057]
44 Having defined the preferences, we must select an appropriate loss measure. [sent-63, score-0.066]
45 A sensible metric here [13] is the joint 1 It is crucial to note that node labels may in general not be directly comparable with one another from one observation to the next (e. [sent-64, score-0.264]
46 : documents with the same rating might not truly have the same degree of relevance for different queries), or the scale of the labels may be arbitrary. [sent-66, score-0.1]
47 probability of observing the order graph Gt = (Vt , Et ) corresponding to the partial ordering of nodes in Vt . [sent-67, score-0.337]
48 From Equation (3), this will take the form of a probability measure over events of the type πe ≤ r(ρ; e, Dt ) so that we obtain P r{Et |Vt , ρ} = P r [πe ≤ r(ρ; e, Dt )] = Fπ r(ρ; Gt ) , (4) e∈Et where Fπ is the joint CDF over the preference variables πe . [sent-68, score-0.217]
49 Given an observation Dt , the goal is to learn the ranking function ρ by maximizing Equation (4). [sent-69, score-0.313]
50 Note that under this framework, the set of edges Et corresponding to the set of pairwise preferences are treated as random variables which may have a high degree of dependence between one another, so that Fπ r(ρ; Gt ) is a joint CDF over multiple pairwise preferences. [sent-70, score-0.499]
51 The problem of learning the ranking function then consists of scoring multiple nodes simultaneously whilst accounting for dependencies between node scores. [sent-71, score-0.547]
52 (6) In general, the above structured loss functional may be difficult to specify, as it takes on the form of a joint CDF over many random variables with a high degree of inter-dependency which may require a large number of parameters to specify. [sent-74, score-0.298]
53 We can, however, compactly represent this using the CDN framework, as we will now show. [sent-75, score-0.054]
54 1 Tranforming order graphs into CDNs Figure 3: Transforming the order graph Gt into a CDN. [sent-77, score-0.123]
55 For each edge e = (Vi → Vj ) in the order graph (left), a preference variable πij is created. [sent-78, score-0.274]
56 All such random variables are then connected to one another in a CDN (right), allowing for complex dependencies between preferences. [sent-79, score-0.102]
57 The representation of the structured loss functional in Equation (5) as a CDN consists of transforming the order graph Gt for a each observation into a set of variable nodes in a CDN. [sent-80, score-0.527]
58 More precisely, for each edge e = (Vi → Vj ) in the order graph, the preference variable πij is created. [sent-81, score-0.193]
59 All such variables are then connected to one another in a CDN (Figure 3), where the pattern of connectivity used will determine the set of dependencies between these preferences πij as given by the marginal and conditional independence properties of CDNs [5]. [sent-82, score-0.317]
60 Thus for any given CDN topology, each preference node πe is a member of some neighborhood of preference nodes πe so that neighboring preferences nodes are marginally dependent of one another. [sent-83, score-0.81]
61 One possible concern here is that we may require a fully connected CDN topology over all possible pairwise preferences between all nodes in order to capture all of these dependencies, leading to a model which is cumbersome to learn. [sent-84, score-0.48]
62 Furthermore, we do not have to store a large CDN in memory during training, as we only need to store a single CDN over a relatively small number of preference variables for the current observation. [sent-86, score-0.202]
63 We can thus perform ranking learning in an online fashion by constructing a single CDN for each observation Dt and optimizing the loss − log Fπ r(ρ; Gt ) defined by that CDN for the given observation. [sent-87, score-0.379]
64 4 StructRank: a probabilistic model for structured ranking learning with node labels Suppose now that each node in the training set is provided with an ordinal node label y along with a feature vector x. [sent-88, score-0.769]
65 For any given order graph over some subset of the nodes, the node labels y allow us to establish edges in the order graph, so that an edge Vi → Vj exists between two nodes Vi , Vj iff yi > yj . [sent-89, score-0.42]
66 Consider now an edge e = (Vi → Vj ) in the order graph and define re ≡ 1 L re (a; Dt ) = ρ(xt ; a) − ρ(xt ; a). [sent-91, score-0.182]
67 For the given CDN and ranking functions, the learning problem for the current observation Dt then becomes log 1 + exp − w1 re (a; Dt ) + exp − w2 re (a; Dt ) inf θ t s. [sent-94, score-0.403]
68 The gradient ∇a L(θ; Dt ) and the derivatives with respect to the CDN function weights w1 , w2 for a given observation Dt are provided in the Supplementary Information. [sent-98, score-0.083]
69 5 Results To compare the performance of our proposed framework to other methods, we will use the following three metrics commonly in use in information retrieval research: Precision, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) [6]. [sent-99, score-0.079]
70 The NDCG accounts for the fact that less relevant documents are less likely to be examine by a user by putting more weight on highly relevant documents than marginally relevant ones. [sent-100, score-0.131]
71 We downloaded the OHSUMED dataset provided as part of the LETOR 2. [sent-101, score-0.051]
72 The dataset consists of a set of 106 query-document pairs, with a feature vector and relevance judgment (a) (b) (c) Figure 4: a) Average NDCG as a function of truncation level n for the OHSUMED dataset. [sent-103, score-0.1]
73 NDCG values are averaged over 5 cross-validation splits; b) Mean average precision (MAP) as a function of truncation level n; c) Mean average precision value for several methods. [sent-104, score-0.085]
74 There are a total of 16,140 query-document pairs with relevance judgments provided by humans on three ordinal levels: definitely relevant, partially relevant or not relevant. [sent-106, score-0.112]
75 For any given query, we used the ordinal labels y for each document in the query in order to establish preferences between documents for that query. [sent-107, score-0.39]
76 Each node in the order graph is provided with 25 query-specific features including term frequency, document length, BM25 and LMIR features as well as combinations thereof [1, 11, 14]. [sent-108, score-0.249]
77 In accordance with the nomenclature above, we use the terms query and observation interchangeably. [sent-109, score-0.081]
78 The OHSUMED dataset is provided in the form of 5 training/validation/test splits of sizes 63/21/22 observations each. [sent-110, score-0.078]
79 To ensure that features are comparable across all observations, we normalized each feature vector within each observation as described in [8]. [sent-111, score-0.058]
80 We tested a fully connected CDN which models full interdependence between preferences, and a completely disconnected CDN which models preferences independently of one another. [sent-118, score-0.247]
81 With the exception of ListNet and ListMLE, none of the above methods explicitly model dependencies between pairwise preferences. [sent-122, score-0.162]
82 As can be seen, accounting for dependencies between pairwise preferences provides a significant gain in performance compared to modellling preferences as being independent. [sent-123, score-0.568]
83 6 Discussion We have proposed here a novel framework for ranking learning using structured loss functionals. [sent-126, score-0.459]
84 We have shown that the problem of learning to rank can be reduced to maximizing a joint CDF over multiple pairwise preferences. [sent-127, score-0.255]
85 We have shown how to compactly represent this using the CDN framework and have applied it to the OHSUMED benchmark dataset. [sent-128, score-0.123]
86 We have demonstrated that representing the dependencies between pairwise preferences leads to improved performance over modelling preferences as being independent of one another. [sent-129, score-0.577]
87 1 Relation to RankNet and ListNet/ListMLE The probability models for ranking proposed by [2, 4, 13] can all be expressed as special cases of models defined by different CDNs. [sent-131, score-0.255]
88 In the case of RankNet [2], the corresponding probability over a given pairwise preference Vi Vj is modelled by a logistic function of ρ(xi ) − ρ(xj ) and the model was optimized using cross-entropy loss. [sent-132, score-0.249]
89 The joint probability of preferences can thus be represented as a completely disconnected CDN with logistic functions in which all pairwise object preferences are treated as being independent. [sent-133, score-0.636]
90 As noted by the authors of [13], the above model is also an example of the Plackett-Luce class of probability models over object scores [9]. [sent-135, score-0.072]
91 In addition, the ListNet/ListMLE frameworks both require a complete ordering over objects by definition: under the CDN framework, we can model partial orderings, with complete orderings as a special case. [sent-136, score-0.379]
92 Our proposed framework unifies the above i=1 views of ranking as different instantiations of a joint CDF over pairwise preferences and hence as particular types of CDNs. [sent-138, score-0.646]
93 This allows us to consider flexible joint CDFs defined over different subsets of object preferences and over different families of CDN functions so as to capture various data specific properties. [sent-139, score-0.326]
94 In [13], it was shown that the loglikelihood corresponding to the probability of an ordering is a good surrogate to the 0-1 loss between the predicted ordering and the true ordering, as the former is differentiable and penalizes mis-orderings in a sensible way. [sent-142, score-0.258]
95 One could investigate connections between the structured loss functionals proposed in this paper and other ranking measures such as NDCG. [sent-143, score-0.492]
96 Another possible direction is to generalize StructRank to products over Gaussian multivariate CDFs or other classes of functions which satisfy the requirements of CDN functions , as in this paper we have elected to use a product of bivariate sigmoids φ(re , re ) to represent our loss functional. [sent-144, score-0.232]
97 In addition, we have only investigated representing the loss functional using a single CDN function: this could easily be generalized to K functions. [sent-146, score-0.11]
98 Learning to rank: from pairwise approach to listwise approach. [sent-182, score-0.145]
99 LETOR: Benchmark dataset for research on learning to rank for information retrieval. [sent-206, score-0.113]
100 Listwise approach to learning to rank - theory and algorithm. [sent-239, score-0.087]
wordName wordTfidf (topN-words)
[('cdn', 0.644), ('ranking', 0.255), ('cdns', 0.199), ('preferences', 0.192), ('cdf', 0.192), ('vj', 0.175), ('dt', 0.166), ('gt', 0.164), ('vi', 0.158), ('preference', 0.136), ('listmle', 0.119), ('orderings', 0.119), ('pairwise', 0.113), ('nodes', 0.108), ('structured', 0.107), ('ranknet', 0.1), ('node', 0.093), ('objects', 0.09), ('ohsumed', 0.089), ('listnet', 0.087), ('rank', 0.087), ('graph', 0.081), ('ordering', 0.078), ('vt', 0.069), ('ranked', 0.066), ('loss', 0.066), ('functionals', 0.064), ('letor', 0.064), ('ndcg', 0.064), ('misranking', 0.06), ('bivariate', 0.06), ('cdfs', 0.06), ('observation', 0.058), ('cumulative', 0.058), ('ordinal', 0.056), ('joint', 0.055), ('compactly', 0.054), ('zc', 0.054), ('dependencies', 0.049), ('retrieval', 0.048), ('documents', 0.047), ('functional', 0.044), ('vn', 0.043), ('jim', 0.04), ('structrank', 0.04), ('multivariate', 0.04), ('scores', 0.038), ('benchmark', 0.038), ('marginally', 0.037), ('toronto', 0.036), ('edge', 0.036), ('sensible', 0.036), ('ij', 0.035), ('object', 0.034), ('listwise', 0.032), ('burges', 0.032), ('framework', 0.031), ('precision', 0.031), ('modelling', 0.031), ('relevance', 0.031), ('qin', 0.03), ('document', 0.029), ('monotonically', 0.028), ('disconnected', 0.028), ('frey', 0.028), ('differentiating', 0.028), ('liu', 0.027), ('observations', 0.027), ('connected', 0.027), ('outputs', 0.027), ('partial', 0.027), ('variables', 0.026), ('dataset', 0.026), ('provided', 0.025), ('frameworks', 0.025), ('probabilistic', 0.025), ('independence', 0.023), ('query', 0.023), ('subsets', 0.023), ('exp', 0.023), ('proceedings', 0.023), ('passes', 0.023), ('truncation', 0.023), ('et', 0.022), ('re', 0.022), ('accounting', 0.022), ('transforming', 0.022), ('functions', 0.022), ('labels', 0.022), ('observing', 0.022), ('icml', 0.021), ('order', 0.021), ('consists', 0.02), ('sigir', 0.02), ('complete', 0.02), ('store', 0.02), ('exible', 0.02), ('subset', 0.019), ('topology', 0.019), ('iff', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 224 nips-2008-Structured ranking learning using cumulative distribution networks
Author: Jim C. Huang, Brendan J. Frey
Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1
2 0.26559818 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
Author: Tao Qin, Tie-yan Liu, Xu-dong Zhang, De-sheng Wang, Hang Li
Abstract: This paper studies global ranking problem by learning to rank methods. Conventional learning to rank methods are usually designed for ‘local ranking’, in the sense that the ranking model is defined on a single object, for example, a document in information retrieval. For many applications, this is a very loose approximation. Relations always exist between objects and it is better to define the ranking model as a function on all the objects to be ranked (i.e., the relations are also included). This paper refers to the problem as global ranking and proposes employing a Continuous Conditional Random Fields (CRF) for conducting the learning task. The Continuous CRF model is defined as a conditional probability distribution over ranking scores of objects conditioned on the objects. It can naturally represent the content information of objects as well as the relation information between objects, necessary for global ranking. Taking two specific information retrieval tasks as examples, the paper shows how the Continuous CRF method can perform global ranking better than baselines.
3 0.16542496 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking
Author: Nir Ailon
Abstract: The problem of ranking arises ubiquitously in almost every aspect of life, and in particular in Machine Learning/Information Retrieval. A statistical model for ranking predicts how humans rank subsets V of some universe U . In this work we define a statistical model for ranking that satisfies certain desirable properties. The model automatically gives rise to a logistic regression based approach to learning how to rank, for which the score and comparison based approaches are dual views. This offers a new generative approach to ranking which can be used for IR. There are two main contexts for this work. The first is the theory of econometrics and study of statistical models explaining human choice of alternatives. In this context, we will compare our model with other well known models. The second context is the problem of ranking in machine learning, usually arising in the context of information retrieval. Here, much work has been done in the discriminative setting, where different heuristics are used to define ranking risk functions. Our model is built rigorously and axiomatically based on very simple desirable properties defined locally for comparisons, and automatically implies the existence of a global score function serving as a natural model parameter which can be efficiently fitted to pairwise comparison judgment data by solving a convex optimization problem. 1
4 0.14173627 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
5 0.13785914 10 nips-2008-A rational model of preference learning and choice prediction by children
Author: Christopher G. Lucas, Thomas L. Griffiths, Fei Xu, Christine Fawcett
Abstract: Young children demonstrate the ability to make inferences about the preferences of other agents based on their choices. However, there exists no overarching account of what children are doing when they learn about preferences or how they use that knowledge. We use a rational model of preference learning, drawing on ideas from economics and computer science, to explain the behavior of children in several recent experiments. Specifically, we show how a simple econometric model can be extended to capture two- to four-year-olds’ use of statistical information in inferring preferences, and their generalization of these preferences. 1
6 0.12300853 239 nips-2008-Tighter Bounds for Structured Estimation
7 0.10414447 72 nips-2008-Empirical performance maximization for linear rank statistics
8 0.089316294 78 nips-2008-Exact Convex Confidence-Weighted Learning
9 0.071507044 171 nips-2008-Online Prediction on Large Diameter Graphs
10 0.071282543 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
11 0.06383463 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
12 0.063303262 168 nips-2008-Online Metric Learning and Fast Similarity Search
13 0.06232835 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
14 0.060685195 73 nips-2008-Estimating Robust Query Models with Convex Optimization
15 0.06053894 214 nips-2008-Sparse Online Learning via Truncated Gradient
16 0.058213502 137 nips-2008-Modeling Short-term Noise Dependence of Spike Counts in Macaque Prefrontal Cortex
17 0.05760204 184 nips-2008-Predictive Indexing for Fast Search
18 0.056008652 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
19 0.055793043 65 nips-2008-Domain Adaptation with Multiple Sources
20 0.055549454 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
topicId topicWeight
[(0, -0.183), (1, -0.044), (2, -0.047), (3, -0.002), (4, -0.048), (5, -0.065), (6, 0.209), (7, -0.032), (8, 0.074), (9, -0.001), (10, -0.179), (11, 0.014), (12, -0.075), (13, 0.088), (14, 0.15), (15, -0.047), (16, 0.056), (17, -0.075), (18, 0.071), (19, 0.257), (20, -0.035), (21, -0.053), (22, -0.015), (23, 0.049), (24, 0.04), (25, -0.069), (26, -0.198), (27, 0.054), (28, 0.078), (29, -0.032), (30, -0.006), (31, -0.131), (32, -0.113), (33, 0.128), (34, 0.08), (35, -0.002), (36, -0.091), (37, -0.041), (38, -0.007), (39, -0.022), (40, -0.011), (41, -0.037), (42, -0.053), (43, 0.016), (44, -0.046), (45, -0.069), (46, 0.073), (47, -0.003), (48, -0.017), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.94645381 224 nips-2008-Structured ranking learning using cumulative distribution networks
Author: Jim C. Huang, Brendan J. Frey
Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1
2 0.83246839 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking
Author: Nir Ailon
Abstract: The problem of ranking arises ubiquitously in almost every aspect of life, and in particular in Machine Learning/Information Retrieval. A statistical model for ranking predicts how humans rank subsets V of some universe U . In this work we define a statistical model for ranking that satisfies certain desirable properties. The model automatically gives rise to a logistic regression based approach to learning how to rank, for which the score and comparison based approaches are dual views. This offers a new generative approach to ranking which can be used for IR. There are two main contexts for this work. The first is the theory of econometrics and study of statistical models explaining human choice of alternatives. In this context, we will compare our model with other well known models. The second context is the problem of ranking in machine learning, usually arising in the context of information retrieval. Here, much work has been done in the discriminative setting, where different heuristics are used to define ranking risk functions. Our model is built rigorously and axiomatically based on very simple desirable properties defined locally for comparisons, and automatically implies the existence of a global score function serving as a natural model parameter which can be efficiently fitted to pairwise comparison judgment data by solving a convex optimization problem. 1
3 0.80643076 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
Author: Tao Qin, Tie-yan Liu, Xu-dong Zhang, De-sheng Wang, Hang Li
Abstract: This paper studies global ranking problem by learning to rank methods. Conventional learning to rank methods are usually designed for ‘local ranking’, in the sense that the ranking model is defined on a single object, for example, a document in information retrieval. For many applications, this is a very loose approximation. Relations always exist between objects and it is better to define the ranking model as a function on all the objects to be ranked (i.e., the relations are also included). This paper refers to the problem as global ranking and proposes employing a Continuous Conditional Random Fields (CRF) for conducting the learning task. The Continuous CRF model is defined as a conditional probability distribution over ranking scores of objects conditioned on the objects. It can naturally represent the content information of objects as well as the relation information between objects, necessary for global ranking. Taking two specific information retrieval tasks as examples, the paper shows how the Continuous CRF method can perform global ranking better than baselines.
4 0.48725218 239 nips-2008-Tighter Bounds for Structured Estimation
Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola
Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1
5 0.42584723 10 nips-2008-A rational model of preference learning and choice prediction by children
Author: Christopher G. Lucas, Thomas L. Griffiths, Fei Xu, Christine Fawcett
Abstract: Young children demonstrate the ability to make inferences about the preferences of other agents based on their choices. However, there exists no overarching account of what children are doing when they learn about preferences or how they use that knowledge. We use a rational model of preference learning, drawing on ideas from economics and computer science, to explain the behavior of children in several recent experiments. Specifically, we show how a simple econometric model can be extended to capture two- to four-year-olds’ use of statistical information in inferring preferences, and their generalization of these preferences. 1
6 0.37737879 72 nips-2008-Empirical performance maximization for linear rank statistics
7 0.36107785 106 nips-2008-Inferring rankings under constrained sensing
8 0.35246646 134 nips-2008-Mixed Membership Stochastic Blockmodels
9 0.33049908 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
10 0.32242903 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
11 0.3106907 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
12 0.31064427 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
13 0.29446688 225 nips-2008-Supervised Bipartite Graph Inference
14 0.2921693 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
15 0.29144177 248 nips-2008-Using matrices to model symbolic relationship
16 0.27829283 169 nips-2008-Online Models for Content Optimization
17 0.27490288 69 nips-2008-Efficient Exact Inference in Planar Ising Models
18 0.26747516 73 nips-2008-Estimating Robust Query Models with Convex Optimization
19 0.26497626 236 nips-2008-The Mondrian Process
20 0.26390666 78 nips-2008-Exact Convex Confidence-Weighted Learning
topicId topicWeight
[(6, 0.072), (7, 0.058), (12, 0.031), (15, 0.304), (28, 0.229), (57, 0.046), (63, 0.034), (71, 0.011), (77, 0.029), (83, 0.068)]
simIndex simValue paperId paperTitle
1 0.97713232 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
2 0.86925739 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
Author: Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani
Abstract: We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from tens to hundreds of neurons on individual experimental trials. Current methods for extracting neural trajectories involve a two-stage process: the data are first “denoised” by smoothing over time, then a static dimensionality reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way, and account for spiking variability that may vary both across neurons and across time. We then present a novel method for extracting neural trajectories, Gaussian-process factor analysis (GPFA), which unifies the smoothing and dimensionality reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that GPFA provided a better characterization of the population activity than the two-stage methods. 1
3 0.83716899 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach
Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1
same-paper 4 0.8234629 224 nips-2008-Structured ranking learning using cumulative distribution networks
Author: Jim C. Huang, Brendan J. Frey
Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1
5 0.68215537 231 nips-2008-Temporal Dynamics of Cognitive Control
Author: Jeremy Reynolds, Michael C. Mozer
Abstract: Cognitive control refers to the flexible deployment of memory and attention in response to task demands and current goals. Control is often studied experimentally by presenting sequences of stimuli, some demanding a response, and others modulating the stimulus-response mapping. In these tasks, participants must maintain information about the current stimulus-response mapping in working memory. Prominent theories of cognitive control use recurrent neural nets to implement working memory, and optimize memory utilization via reinforcement learning. We present a novel perspective on cognitive control in which working memory representations are intrinsically probabilistic, and control operations that maintain and update working memory are dynamically determined via probabilistic inference. We show that our model provides a parsimonious account of behavioral and neuroimaging data, and suggest that it offers an elegant conceptualization of control in which behavior can be cast as optimal, subject to limitations on learning and the rate of information processing. Moreover, our model provides insight into how task instructions can be directly translated into appropriate behavior and then efficiently refined with subsequent task experience. 1
6 0.67747819 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
7 0.67070556 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
8 0.66337067 101 nips-2008-Human Active Learning
9 0.66279018 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
10 0.66193056 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation
11 0.66165078 159 nips-2008-On Bootstrapping the ROC Curve
12 0.66050667 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
13 0.65917706 10 nips-2008-A rational model of preference learning and choice prediction by children
14 0.65863264 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
15 0.65855008 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
16 0.65809751 106 nips-2008-Inferring rankings under constrained sensing
17 0.65794027 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
18 0.65754312 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
19 0.65730047 202 nips-2008-Robust Regression and Lasso
20 0.65629041 194 nips-2008-Regularized Learning with Networks of Features