nips nips2012 nips2012-252 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Claudio Gentile, Francesco Orabona
Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. [sent-6, score-0.332]
2 We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. [sent-8, score-0.206]
3 Typical feedback in such systems is the actual action of the user or, in particular, what books he has bought/preferred, if any. [sent-14, score-0.326]
4 The system cannot observe what would have been the user’s actions had other books got recommended, or had the same book ads been placed in a different order within the webpage. [sent-15, score-0.162]
5 , the user’s action for each and every possible book recommendation placed in the largest banner ad), in the partial feedback setting, the system only observes the response to very limited options and, specifically, the option that was actually recommended. [sent-19, score-0.391]
6 Moreover, it is often plausible to interpret the user feedback as a preference (if any) restricted to the displayed alternatives. [sent-23, score-0.265]
7 We consider instantiations of this problem in the multilabel and learning-to-rank settings. [sent-24, score-0.206]
8 Learning proceeds in rounds, in each time step t the algorithm receives an instance xt and outputs an ordered ˆ subset Yt of labels from a finite set of possible labels [K] = {1, 2, . [sent-25, score-0.41]
9 The set Yt corresponds to the aforementioned recommendations, and is intended to approximate the true set of preferences associated with xt . [sent-32, score-0.243]
10 In its stead, the algorithm receives ˆ Yt ∩ Yt , where Yt ⊆ [K] is a noisy version of the true set of user preferences on xt . [sent-34, score-0.357]
11 When we are ˆ restricted to |Yt | = 1 for all t, this becomes a multiclass classification problem with bandit feedback – see below. [sent-35, score-0.369]
12 This paper lies at the intersection between online learning with partial feedback and multilabel classification/ranking. [sent-37, score-0.489]
13 1 A well-known and standard tool of facing the problem of partial feedback in online learning is to trade off exploration and exploitation through upper confidence bounds [16]. [sent-40, score-0.325]
14 In the so-called bandit setting with contextual information (sometimes called bandits with side information or bandits with covariates, e. [sent-41, score-0.23]
15 , the actual value of the loss on the chosen action, some information on more profitable directions on the action space, noisy versions thereof, etc. [sent-49, score-0.162]
16 The overall goal is to compete against classes of functions that map contexts to (expected) losses in a regret sense, that is, to obtain sublinear cumulative regret bounds. [sent-50, score-0.4]
17 Linear multiclass classification problems with bandit feedback are considered in, e. [sent-55, score-0.351]
18 , [4, 11, 13], where either T 2/3 or T 1/2 or even logarithmic regret bounds are proven, depending on the noise model and the underlying loss functions. [sent-57, score-0.262]
19 All the above papers do not consider structured action spaces, where the learner is afforded to select sets of actions, which is more suitable to multilabel and ranking problems. [sent-58, score-0.403]
20 The general problem of online minimization of a submodular loss function under both full and bandit information without covariates is considered in [10], achieving a regret T 2/3 in the bandit case. [sent-60, score-0.565]
21 Their problem shares similar motivations as ours but, again, the bandit version of their algorithm does not explicitly take side information into account, and leads to a T 2/3 regret bound. [sent-66, score-0.267]
22 Among other things, the authors prove a T 1/2 regret bound in the bandit setting with a multiplicative weight updating scheme. [sent-68, score-0.248]
23 [19] uses a simple linear model for the hidden utility function of users interacting with a web system and providing partial feedback in any form that allows the system to make significant progress in learning this function. [sent-74, score-0.29]
24 It is experimentally argued that this feedback is typically made available by a user that clicks on relevant URLs out of a list presented by a search engine. [sent-76, score-0.297]
25 Finally, the recent paper [2] investigates classes of graphical models for contextual bandit settings that afford richer interaction between contexts and actions leading again to a T 2/3 regret bound. [sent-78, score-0.399]
26 The literature on multilabel learning and learning to rank is overwhelming. [sent-79, score-0.242]
27 We investigate the multilabel and learning-to-rank problems in a partial feedback scenario with contextual information, where we assume a probabilistic linear model over the labels, although the contexts can be chosen by an adaptive adversary. [sent-91, score-0.516]
28 We consider two families of loss functions, one is a cost-sensitive multilabel loss that generalizes the standard Hamming loss in several respects, the other is a kind of (unnormalized) ranking loss. [sent-92, score-0.708]
29 In both cases, the learning algorithm is maintaining a (generalized) linear predictor for the probability that a given label occurs, the ranking being produced by upper confidence-corrected estimated probabilities. [sent-93, score-0.207]
30 In such settings, we prove 2 T 1/2 log T cumulative regret bounds — these bounds are optimal, up to log factors, when the label probabilities are fully linear in the contexts. [sent-94, score-0.231]
31 A distinguishing feature of our user feedback model is that, unlike previous papers (e. [sent-95, score-0.268]
32 Furthermore, when operating on structured action spaces this more traditional form of bandit model does not seem appropriate to capture the typical user preference feedback. [sent-99, score-0.232]
33 Our approach is based on having the loss decouple from the label generating model, the user feedback being a noisy version of the gradient of a surrogate convex loss associated with the model itself. [sent-100, score-0.525]
34 Though the emphasis is on theoretical results, we also validate our algorithms on two real-world multilabel datasets w. [sent-102, score-0.224]
35 The loss suffered by the algorithm may take into account several things: the distance between Yt ˆ ˆ ˆ and Yt (both viewed as sets), as well as the cost for playing Yt . [sent-107, score-0.192]
36 c(s, s) ≥ 0, for all s ∈ [K], we consider the loss function ˆ = a |Yt \ Yt | + (1 − a) ˆ a,c (Yt , Yt ) ˆ i∈Yt \Yt ˆ c(ji , |Yt |), ˆ ˆ ˆ where ji is the position of class i in Yt , and c(ji , ·) depends on Yt only through its size |Yt |. [sent-115, score-0.229]
37 The second term collects the loss contribution provided by all false positive classes, ˆ ˆ taking into account through the costs c(ji , |Yt |) the order in which labels occur in Yt . [sent-117, score-0.236]
38 , the cost for mistakingly playing class 4 in the top slot of Yt is ˆ ˆ i∈Yt \Yt c(ji , |Y more damaging than mistakingly playing class 6 in the third slot. [sent-126, score-0.15]
39 In the special case when all costs ˆ are unitary, there is no longer need to view Yt as an ordered collection, and the above loss reduces to ˆ ˆ ˆ a standard Hamming-like loss between sets Yt and Yt , i. [sent-127, score-0.352]
40 Notice that ˆ ˆ the partial feedback Yt ∩ Yt allows the algorithm to know which of the chosen classes in Yt are good ˆ or bad (and to what extent, because of the selected ordering within Yt ). [sent-130, score-0.358]
41 ˆ Working with the above loss function makes the algorithm’s output Yt become a ranked list of classes, where ranking is restricted to the deemed relevant classes only. [sent-132, score-0.417]
42 In our setting, only a releˆ vance feedback among the selected classes is observed (the set Yt ∩ Yt ), but no supervised ranking information (e. [sent-133, score-0.392]
43 ˆ Alternatively, we can think of a ranking framework where restrictions on the size of Yt are set by an exogenous (and possibly time-varying) parameter of the problem, and the algorithm is required to provide a ranking complying with these restrictions. [sent-136, score-0.291]
44 More on the connection to the ranking setting with partial feedback is in Section 4. [sent-137, score-0.384]
45 (1) ˆ i∈Yt Let Pt (·) be a shorthand for the conditional probability Pt (· | xt ), where the side information vector xt can in principle be generated by an adaptive adversary as a function of the past. [sent-145, score-0.43]
46 , yK,t | xt ), where the marginals Pt (yi,t = 1) satisfy1 Pt (yi,t = 1) = g(−ui xt ) , g(ui xt ) + g(−ui xt ) i = 1, . [sent-152, score-0.86]
47 The function g above will be instantiated to the negative derivative of a suitable convex and nonincreasing loss function L which our algorithm will be based upon. [sent-166, score-0.177]
48 For instance, if L is the square loss L(∆) = (1 − ∆)2 /2, then g(∆) = 1 − ∆, resulting in Pt (yi,t = 1) = (1 + ui xt )/2, under the assumption D = [−1, 1]. [sent-167, score-0.433]
49 If L is the logistic loss L(∆) = ln(1 + e−∆ ), then g(∆) = (e∆ + 1)−1 , and Pt (yi,t = 1) = eui xt /(eui xt + 1), with domain D = R. [sent-168, score-0.717]
50 the generation of labels Yt , conditioned on both xt , and all previous x and Y . [sent-173, score-0.253]
51 In turn, Ys,t can be computed just by sorting classes i ∈ [K] in decreasing order of pi,t , sequence ∗ Ys,t being given by the first s classes in this sorted list. [sent-186, score-0.236]
52 It is Yt that we interpret as the true set of user preferences on xt . [sent-204, score-0.303]
53 We would like to compete against the above Yt∗ in a cumulative regret sense, i. [sent-205, score-0.172]
54 Observe that the expected loss function (3) is, generally speaking, nonconvex in the margins ∆i,t (consider, for instance the logistic case g(∆) = e∆1 ). [sent-215, score-0.257]
55 For i ∈ [K], set ∆i,t = xt wi,t , where wi,t wi,t = wi,t xt −R sign(wi,t xt ) wi,t − −1 xt Ai,t−1 xt 3. [sent-233, score-1.075]
56 j|Y | )⊆[K] i∈Y if wi,t xt ∈ [−R, R], A−1 xt i,t−1 c(ji , |Y |) − g(−[∆i,t + i,t ] ) where : pi,t = g([∆ + ] )+g(−[∆D + ] ) , i,t D i,t D i,t i,t dc 2 12 = xt A−1 xt U 2 + (c L2 ln 1 + t−1 + c i,t i,t−1 d ) L L cL cL otherwise; a 1−a + c(ji , |Y |) pi,t + 3L(−R) ln ˆ 4. [sent-237, score-0.976]
57 For i ∈ [K], update Ai,t = Ai,t−1 + |si,t |xt xt , wi,t+1 = wi,t − si,t and i,t = w L(si,t K(t+4) δ 1 A−1 cL i,t i,t , , ; where ˆ 1 If i ∈ Yt ∩ Yt ˆ ˆ ˆ = −1 If i ∈ Yt \ Yt = Yt \ (Yt ∩ Yt ) 0 otherwise; w xt )|w=wi,t = −g(si,t ∆i,t ) si,t xt . [sent-239, score-0.645]
58 Figure 1: The partial feedback algorithm in the (ordered) multiple label setting. [sent-240, score-0.301]
59 3 Algorithm and regret bounds In Figure 1 is our bandit algorithm for (ordered) multiple labels. [sent-241, score-0.291]
60 For the sake of brevity, we let ∆i,t = xt wi,t , and ∆i,t = ui xt , i ∈ [K]. [sent-249, score-0.505]
61 Thus the algorithm is producing a ranked list of relevant classes based on upper-confidence-corrected scores pi,t . [sent-257, score-0.16]
62 At each time step t, upon receiving the d-dimensional instance vector xt the algorithm uses the weight vectors wi,t to compute the prediction vectors wi,t . [sent-260, score-0.259]
63 These vectors can easily be seen as the result of projecting wi,t onto the space of w where |w xt | ≤ R w. [sent-261, score-0.215]
64 , wi,t = argminw∈Rd : w xt ∈D di,t−1 (w, wi,t ), i ∈ [K], where di,t (u, w) = (u − w) Ai,t (u − w) . [sent-266, score-0.215]
65 Next, the feedback Yt ∩ Yt is ˆt (sign si,t = 1), demotes observed, and the algorithm in Figure 1 promotes all classes i ∈ Yt ∩ Y 5 ˆ all classes i ∈ Yt \ Yt (sign si,t = −1), and leaves all remaining classes i ∈ Yt unchanged (sign / ˆ si,t = 0). [sent-268, score-0.413]
66 On the other hand, the update Ai,t−1 → Ai,t uses the rank one matrix4 xt xt . [sent-270, score-0.466]
67 Then the cumulative regret RT of the algorithm in Figure 1 satisfies, with probability at least 1 − δ, RT = O (1 − a) cL K where C = O U 2 + d cL (cL )2 ln 1 + T d + cL (cL )2 T C d ln 1 + + L(−R) cL T d , ln KT . [sent-282, score-0.342]
68 δ It is easy to see that when L(·) is the square loss L(∆) = (1 − ∆)2 /2 and D = [−1, 1], we have cL = 1/2, cL = 4 and cL = 1; when L(·) is the logistic loss L(∆) = ln(1 + e−∆ ) and x −x 1 D = [−R, R], we have cL = 1/4, cL ≤ 1 and cL = 2(1+cosh(R)) , where cosh(x) = e +e . [sent-283, score-0.42]
69 In turn, because p(∆) = g(∆)+g(−∆) is increasing in ∆, this ordering corresponds to sorting classes in decreasing values of ∆i,t . [sent-292, score-0.17]
70 ˆ Moreover, it does so by receiving full feedback on the relevant classes at time t (since Yt ∩ Yt = Yt ). [sent-294, score-0.309]
71 , [6]), one can view any multilabel assignment Y = (y1 , . [sent-297, score-0.206]
72 , yK ) ∈ {0, 1}K as a ranking among the K classes in the most natural way: i preceeds j if and only if yi > yj . [sent-300, score-0.205]
73 The (unnormalized) ranking loss function rank (Y, f ) between the multilabel Y and a ranking function f : Rd → RK , representing degrees of class relevance sorted in a decreasing order fj1 (xt ) ≥ fj2 (xt ) ≥ . [sent-301, score-0.719]
74 Hence we can use this algorithm for tackling ranking problems derived from multilabel ones, when the measure of choice is rank and the feedback is full. [sent-320, score-0.584]
75 Suppose that at each ˆ time t, the environment discloses both xt and a maximal size St for the ordered subset Yt = (j1 , j2 , . [sent-322, score-0.28]
76 , j|Yt | ) (both xt and St can be chosen adaptively by an adversary). [sent-325, score-0.215]
77 Here St might be ˆ the number of available slots in a webpage or the number of URLs returned by a search engine in response to query xt . [sent-326, score-0.267]
78 Then it is plausible to compete in a regret sense against the best time-t offline ranking of the form f (xt ) = (f1 (xt ), f2 (xt ), . [sent-327, score-0.275]
79 Further, the rankˆ ing loss could be reasonably restricted to count the number of class pairs disagreeing within Yt plus a quantity related to number of false negative mistakes. [sent-334, score-0.173]
80 If we put on the argmin (Step 3 in Figure 1) the further constraint |Y | ≤ St (we are still sorting classes according to decreasing values of pi,t ), one can prove the following ranking version of Theorem 2. [sent-351, score-0.284]
81 This suggests that, to some extent, we are decoupling the label generating model from the loss function under consideration. [sent-385, score-0.156]
82 nonrestricted number of classes), loss measures ( a,c , rank,t , and Hamming loss) and model/parameter settings (L = square loss, L = logistic loss, with varying R). [sent-390, score-0.298]
83 We used the algorithm in Figure 1 with two different loss functions, the square loss and the logistic loss, and varied the parameter R for the latter. [sent-400, score-0.439]
84 In the decreasing c scenario, we evaluated the performance of the algorithm on the loss a,c that the algorithm is minimizing, but also its ability to produce meaningful (partial) 7 Sony CSL Paris − Hamming loss and constant c Final Average Hamming loss 50 100 50 33 OBR Square Logistic (R=1. [sent-408, score-0.449]
85 0) Running average la,c Sony CLS Paris − Ranking loss 55 Sony CSL Paris − Loss(a,c) and decreasing c 150 45 40 30 29 28 27 OBR Square Logistic (R=1. [sent-416, score-0.167]
86 0) 20 Running average la,c Mediamill − Ranking loss Mediamill − Hamming loss and constant c Mediamill − Loss(a,c) and decreasing c 25 5. [sent-431, score-0.289]
87 As is typical of multilabel problems, the label density, i. [sent-444, score-0.24]
88 For the constant c and ranking loss experiments we tried out different values of S, and reported the final performance. [sent-450, score-0.258]
89 We call this algorithm OBR (Online Binary Relevance), because it is a natural online adaptation of the binary relevance algorithm, widely used as a baseline in the multilabel literature. [sent-453, score-0.279]
90 OBR was used to produce subsets (as in the Hamming loss case), and restricted rankings (as in the case of rank,t ). [sent-455, score-0.183]
91 Similar plots are on the right with the final average ranking losses rank,t divided by S. [sent-463, score-0.157]
92 In many experiments the square loss seems to give better results. [sent-466, score-0.163]
93 Exception is the ranking loss on the Mediamill dataset (Figure 3, right). [sent-467, score-0.258]
94 We have used generalized linear models to formalize the exploration-exploitation tradeoff in a multilabel/ranking setting with partial feedback, providing T 1/2 -like regret bounds under semi-adversarial settings. [sent-469, score-0.226]
95 Thanks to the usage of calibrated score values pi,t , our algorithm is capable of automatically inferring where to split the ranking between relevant and nonrelevant classes [9], the split being clearly induced by the loss parameters in a,c . [sent-471, score-0.374]
96 We are planning on using more general label models that explicitly capture label correlations to be applied to other loss functions (e. [sent-472, score-0.19]
97 We are also planning on carrying out a more thorough experimental comparison, especially to full information multilabel methods that take such correlations into account. [sent-476, score-0.206]
98 On label dependence and loss minimization in multi-label classification. [sent-510, score-0.156]
99 Newtron: an efficient bandit algorithm for online multiclass prediction. [sent-543, score-0.218]
100 Improving multilabel analysis of music titles: A large-scale validation of the correction approach. [sent-587, score-0.206]
wordName wordTfidf (topN-words)
[('yt', 0.708), ('cl', 0.319), ('xt', 0.215), ('multilabel', 0.206), ('feedback', 0.187), ('ranking', 0.136), ('logistic', 0.135), ('bandit', 0.132), ('loss', 0.122), ('obr', 0.12), ('regret', 0.116), ('ji', 0.107), ('mediamill', 0.105), ('pt', 0.075), ('hamming', 0.071), ('classes', 0.069), ('sony', 0.066), ('ordered', 0.065), ('partial', 0.061), ('user', 0.06), ('csl', 0.06), ('ln', 0.058), ('ui', 0.055), ('rt', 0.048), ('dence', 0.048), ('decreasing', 0.045), ('pist', 0.045), ('final', 0.043), ('rankings', 0.043), ('costs', 0.043), ('st', 0.042), ('square', 0.041), ('action', 0.04), ('contextual', 0.04), ('books', 0.039), ('labels', 0.038), ('nonincreasing', 0.036), ('rank', 0.036), ('online', 0.035), ('receives', 0.035), ('fj', 0.035), ('paris', 0.034), ('label', 0.034), ('sorting', 0.034), ('bayes', 0.034), ('false', 0.033), ('cumulative', 0.033), ('playing', 0.032), ('multiclass', 0.032), ('placed', 0.031), ('argminy', 0.03), ('banners', 0.03), ('cosh', 0.03), ('eui', 0.03), ('mistakingly', 0.03), ('pjst', 0.03), ('rdk', 0.03), ('slots', 0.03), ('bandits', 0.029), ('rd', 0.029), ('relevant', 0.028), ('covariates', 0.028), ('preferences', 0.028), ('book', 0.027), ('slot', 0.026), ('generalized', 0.025), ('receiving', 0.025), ('notice', 0.025), ('ads', 0.024), ('francesco', 0.024), ('urls', 0.024), ('bounds', 0.024), ('recommendation', 0.024), ('fi', 0.024), ('compete', 0.023), ('ordering', 0.022), ('conditionally', 0.022), ('ranked', 0.022), ('list', 0.022), ('thereof', 0.022), ('webpage', 0.022), ('contexts', 0.022), ('system', 0.021), ('papers', 0.021), ('losses', 0.021), ('actions', 0.02), ('uk', 0.02), ('sake', 0.02), ('meant', 0.019), ('relevance', 0.019), ('algorithm', 0.019), ('suffered', 0.019), ('sorted', 0.019), ('restricted', 0.018), ('adversarial', 0.018), ('mistakes', 0.018), ('emphasis', 0.018), ('sign', 0.018), ('upper', 0.018), ('things', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
Author: Claudio Gentile, Francesco Orabona
Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1
2 0.63836223 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
3 0.37940043 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun
Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.
4 0.33410904 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic
Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1
5 0.26958129 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
Author: James Scott, Jonathan W. Pillow
Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1
6 0.2677663 293 nips-2012-Relax and Randomize : From Value to Algorithms
7 0.2338025 292 nips-2012-Regularized Off-Policy TD-Learning
8 0.22349802 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
9 0.21688068 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
10 0.1897943 324 nips-2012-Stochastic Gradient Descent with Only One Projection
11 0.15386122 283 nips-2012-Putting Bayes to sleep
12 0.15354793 303 nips-2012-Searching for objects driven by context
13 0.13810095 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
14 0.11493658 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
15 0.11357685 72 nips-2012-Cocktail Party Processing via Structured Prediction
16 0.11090796 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
17 0.10445837 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
18 0.10419806 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
19 0.10293599 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
20 0.099594399 295 nips-2012-Risk-Aversion in Multi-armed Bandits
topicId topicWeight
[(0, 0.249), (1, -0.056), (2, 0.249), (3, 0.512), (4, 0.195), (5, -0.278), (6, -0.004), (7, 0.028), (8, 0.008), (9, 0.143), (10, 0.014), (11, 0.015), (12, 0.025), (13, 0.078), (14, 0.07), (15, 0.017), (16, 0.087), (17, 0.014), (18, 0.065), (19, -0.01), (20, -0.002), (21, 0.016), (22, 0.047), (23, 0.111), (24, 0.094), (25, -0.015), (26, 0.037), (27, -0.078), (28, -0.059), (29, 0.052), (30, 0.05), (31, 0.048), (32, -0.032), (33, 0.035), (34, -0.005), (35, -0.031), (36, 0.037), (37, 0.023), (38, -0.052), (39, -0.028), (40, 0.086), (41, 0.018), (42, 0.023), (43, 0.031), (44, -0.003), (45, -0.001), (46, 0.014), (47, 0.005), (48, 0.05), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.97932535 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
Author: Claudio Gentile, Francesco Orabona
Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1
2 0.91410285 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
3 0.82152498 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
Author: Liva Ralaivola
Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1
4 0.79921418 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic
Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1
5 0.75313234 283 nips-2012-Putting Bayes to sleep
Author: Dmitry Adamskiy, Manfred K. Warmuth, Wouter M. Koolen
Abstract: We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such “sparse composite models” is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the “specialist” framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound. 1
6 0.70189697 293 nips-2012-Relax and Randomize : From Value to Algorithms
7 0.67387843 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
8 0.61631805 292 nips-2012-Regularized Off-Policy TD-Learning
9 0.60611421 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
10 0.59136671 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
11 0.55719906 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
12 0.54822171 72 nips-2012-Cocktail Party Processing via Structured Prediction
13 0.44041055 303 nips-2012-Searching for objects driven by context
14 0.42855984 11 nips-2012-A Marginalized Particle Gaussian Process Regression
16 0.39561096 324 nips-2012-Stochastic Gradient Descent with Only One Projection
17 0.38372293 222 nips-2012-Multi-Task Averaging
18 0.38116136 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
19 0.37210372 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
20 0.36901438 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
topicId topicWeight
[(0, 0.028), (21, 0.029), (29, 0.124), (36, 0.028), (38, 0.127), (39, 0.012), (42, 0.03), (53, 0.019), (54, 0.031), (55, 0.025), (74, 0.05), (76, 0.14), (80, 0.174), (92, 0.086)]
simIndex simValue paperId paperTitle
1 0.94448102 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi
Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1
same-paper 2 0.90806574 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
Author: Claudio Gentile, Francesco Orabona
Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1
3 0.88913858 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
4 0.88712686 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
5 0.88337064 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
6 0.88270146 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
7 0.8795284 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
8 0.87769103 200 nips-2012-Local Supervised Learning through Space Partitioning
9 0.87551379 65 nips-2012-Cardinality Restricted Boltzmann Machines
10 0.86875743 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
11 0.86802137 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
12 0.86728108 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
13 0.86720699 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
14 0.86544156 279 nips-2012-Projection Retrieval for Classification
15 0.86499012 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
16 0.8631162 168 nips-2012-Kernel Latent SVM for Visual Recognition
17 0.86275166 293 nips-2012-Relax and Randomize : From Value to Algorithms
18 0.86220443 100 nips-2012-Discriminative Learning of Sum-Product Networks
19 0.8610068 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
20 0.86052614 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)