nips nips2012 nips2012-227 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. [sent-5, score-0.988]
2 In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. [sent-6, score-0.233]
3 1 Introduction As bigger and more complex datasets are available, multiclass learning is becoming increasingly important in machine learning. [sent-9, score-0.249]
4 While theory and algorithms for solving binary classification problems are well established, the problem of multicategory classification is much less understood. [sent-10, score-0.21]
5 Practical multiclass algorithms often reduce the problem to a collection of binary classification problems. [sent-11, score-0.329]
6 Binary classification algorithms are often based on a relaxation approach: classification is posed as a non-convex minimization problem and then relaxed to a convex one, defined by suitable convex loss functions. [sent-12, score-0.504]
7 In this context, results in statistical learning theory quantify the error incurred by relaxation and in particular derive comparison inequalities explicitly relating the excess misclassification risk to the excess expected loss. [sent-13, score-0.558]
8 In these works the error due to relaxation is studied asymptotically and under constraints on the function class to be considered. [sent-21, score-0.196]
9 In this paper we dicuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, in which a relaxation error analysis can be developed avoiding constraints on the considered hypotheses class. [sent-25, score-0.972]
10 Interestingly, using the simplex coding, we can naturally generalize results, proof techniques and methods from the binary case, which is recovered as a special case of our theory. [sent-27, score-0.525]
11 Due to space restriction in this paper we focus on extensions of least squares, and SVM loss functions, but our analysis can be generalized to a large class 1 of simplex loss functions, including extensions of the logistic and exponential loss functions (used in boosting). [sent-28, score-1.153]
12 In Section 3 we discuss the simplex coding framework which we analyze in Section 4. [sent-32, score-0.59]
13 A classification rule is a map b : X → Y, and its error is measured by the misclassification risk R(b) = P(b(X) = Y ) = E(1 [b(x)=y] (X, Y )). [sent-42, score-0.226]
14 Computing the Bayes rule by directly minimizing the risk R is not possible since the probability distribution is unknown. [sent-44, score-0.16]
15 One n 1 might think of minimizing the empirical risk (ERM) RS (b) = n i=1 1 [b(x)=y] (xi , yi ), which is an I unbiased estimator of the R, but the corresponding optimization problem is in general not feasible. [sent-45, score-0.148]
16 In binary classification, one of the most common ways to obtain computationally efficient methods is based on a relaxation approach. [sent-46, score-0.249]
17 We recall this approach in the next section and describe its extension to multiclass in the rest of the paper. [sent-47, score-0.249]
18 Most modern machine learning algorithms for binary classification consider a convex relaxation of the ERM functional RS . [sent-50, score-0.297]
19 More precisely: 1) the indicator function in RS is replaced by non negative loss V : Y × R → R+ which is convex in the second argument and is sometimes called a surrogate loss; 2) the classification rule b replaced by a real valued measurable function f : X → R. [sent-51, score-0.37]
20 It often suffices to consider a special class of loss functions, namely large margin loss functions V : R → R+ of the form V (−yf (x)). [sent-53, score-0.501]
21 1 Relaxation Error Analysis As we replace the misclassification loss with a convex surrogate loss, we are effectively changing the problem: the misclassification risk is replaced by the expected loss, E(f ) = E(V (−Y f (X))) . [sent-59, score-0.381]
22 The expected loss can be seen as a functional on a large space of functions F = FV,ρ , which depend on V and ρ. [sent-60, score-0.238]
23 The question arises of the price we pay by a considering a relaxation approach: “What is the relationship between fρ and bρ ? [sent-62, score-0.201]
24 ” More generally, “What is the price we incur into by estimating the expected risk rather than the misclassification risk? [sent-63, score-0.139]
25 ” The relaxation error for a given loss function can be quantified by the following two requirements: 1) Fisher Consistency. [sent-64, score-0.39]
26 A loss function is Fisher consistent if sign(fρ (x)) = bρ (x) almost surely (this property is related to the notion of classification-calibration [2]). [sent-65, score-0.194]
27 The excess misclassification risk, and the excess expected loss are related by a comparison inequality R(sign(f )) − R(bρ ) ≤ ψ(E(f ) − E(fρ )), for any function f ∈ F, where ψ = ψV,ρ is a suitable function that depends on V , and possibly on the data distribution. [sent-67, score-0.486]
28 If ψ is explicitly known, then bounds on the excess expected loss yield bounds on the excess misclassification risk. [sent-71, score-0.384]
29 The relaxation error in the binary case has been thoroughly studied in [2, 14]. [sent-72, score-0.276]
30 In particular, Theorem 2 in [2] shows that if a large margin surrogate loss is convex, differentiable and decreasing in a neighborhood of 0, then the loss is Fisher consistent. [sent-73, score-0.489]
31 In particular, for the hinge loss the target function is √ exactly the Bayes rule and ψ(t) = |t|. [sent-75, score-0.289]
32 The comparison inequality for the square loss can be improved for a suitable class of probability distribution satisfying the so-called Tsybakov noise condition [22], ρX ({x ∈ X , |fρ (x)| ≤ s}) ≤ 1 Bq sq , s ∈ [0, 1], q > 0. [sent-77, score-0.368]
33 In this case the comparison inequality for the square loss is given by ψ(t) = cq t q+2 , see [2, 27]. [sent-79, score-0.29]
34 From a practical perspective, over the years, several computational solutions to multiclass learning have been proposed. [sent-81, score-0.249]
35 Indeed, most of the above methods can be interpreted as a kind of relaxation of the original multiclass problem. [sent-83, score-0.418]
36 Interestingly, the study in [15] suggests that the simple one-vs all schemes should be a practical benchmark for multiclass algorithms as it seems experimentally to achive performance that is similar or better than more sophisticated methods. [sent-84, score-0.249]
37 As we previously mentioned from a theoretical perspective a general account of a large class of multiclass methods has been given in [20], building on results in [2] and [28]. [sent-85, score-0.249]
38 These results, see also [11, 23], are developed in a setting where a classification rule is found by applying a suitable prediction/decoding map to a function f : X → RT where f is found considering a loss function V : Y × RT → R+ . [sent-87, score-0.331]
39 In all the above papers relaxation is studied in terms of Fisher and Bayes consistency and the explicit form of the function ψ is not given. [sent-91, score-0.259]
40 More quantitative results in terms of explicit comparison inequality are given in [4] and (see also [19]), but also need to to impose the ”sum to zero” constraint on the considered function class. [sent-92, score-0.136]
41 3 A Relaxation Approach to Multicategory Classification In this section we propose a natural extension of the relaxation approach that avoids constraining the class of functions to be considered, and allows us to derive explicit comparison inequalities. [sent-93, score-0.285]
42 c2 α c1 c3 Figure 1: Decoding with simplex coding T = 3. [sent-95, score-0.59]
43 A coding map turns a label y ∈ Y into a code vector. [sent-98, score-0.184]
44 Note that this is what we implicitly did while treating binary classification,we encoded the label space Y = {1, 2} using the coding ±1, so that the naturally decoding strategy is simply sign(f (x)). [sent-100, score-0.294]
45 The simplex coding is a map C : Y → RT −1 , C(y) = cy , 2 where the code vectors C = {cy | y ∈ Y} ⊂ RT −1 satisfy: 1) cy = 1, ∀y ∈ Y, 2) cy , cy = 1 − T −1 , for y = y with y, y ∈ Y, and 3) y∈Y cy = 0. [sent-103, score-3.009]
46 The simplex coding has been considered in [8],[26], and [16]. [sent-108, score-0.627]
47 It corresponds to T maximally separated vectors on the hypersphere ST −2 in RT −1 , that are the vertices of a simplex (see Figure 1). [sent-109, score-0.445]
48 For binary classification it reduces to the ±1 coding and the decoding map is equivalent to taking the sign of f . [sent-110, score-0.377]
49 The decoding map has a natural geometric interpretation: an input point is mapped to a vector f (x) by a function f : X → RT −1 , and hence assigned to the class having closest code 2 2 vector ( for y, y ∈ Y and α ∈ RT −1 , we have cy − α ≥ cy − α ⇔ cy , α ≤ cy , α ). [sent-111, score-2.012]
50 We use the simplex coding to propose an extension of binary classification. [sent-113, score-0.67]
51 Following the binary case, the relaxation can be described in two steps: 1. [sent-114, score-0.249]
52 using the simplex coding, the indicator function is upper bounded by a non-negative loss function V : Y × RT −1 → R+ , such that 1 [b(x)=y] (x, y) ≤ V (y, C(b(x))), for all b : X → I Y, and x ∈ X , y ∈ Y, 2. [sent-115, score-0.639]
53 In the next section we discuss several loss functions satisfying the above conditions and we study in particular the extension of the least squares and SVM loss functions. [sent-117, score-0.495]
54 Several loss functions for binary classification can be naturally extended to multiple classes using the simplex coding. [sent-119, score-0.763]
55 Due to space restriction, in this paper we focus on extensions of the least squares and SVM loss functions, but our analysis can be generalized to a large class of loss functions, including extensions of logistic and exponential loss functions 2 (used in boosting). [sent-120, score-0.771]
56 The Simplex Least Square loss (S-LS) is given by V (y, f (x)) = cy − f (x) , and reduces to the usual least square approach to binary classification for T = 2. [sent-121, score-0.789]
57 One natural extension of the SVM’s hinge loss in this setting would be to consider the Simplex Half space SVM loss (SH-SVM) V (y, f (x)) = |1 − cy , f (x) |+ . [sent-122, score-0.906]
58 We will see in the following that while this loss function would induce efficient algorithms in general is not Fisher consistent unless further constraints are assumed. [sent-123, score-0.194]
59 We then consider a second loss function Simplex Cone SVM (SC-SVM), which is defined as 1 . [sent-125, score-0.194]
60 The latter loss function is related to the one considered V (y, f (x)) = y =y T −1 + cy , f (x) + in the multiclass SVM proposed in [10]. [sent-126, score-0.99]
61 We will see that it is possible to quantify the relaxation error of the loss function without requiring further constraints. [sent-127, score-0.39]
62 Both of the above SVM loss functions reduce to the binary SVM hinge loss if T = 2. [sent-128, score-0.554]
63 An SVM loss is considered in [8] where V (y, f (x)) = y =y |ε − f (x), vy (y) |+ and vy (y) = cy −cy cy −cy , with ε = cy , vy (y) = 1 √ 2 T T −1 . [sent-130, score-1.812]
64 More recently [26] considered the loss function V (y, f (x)) = | cy − f (x) − ε|+ , and a simplex multiclass boosting loss was introduced in [16], in our notation V (y, f (x)) = j=y e− cy −cy ,f (x) . [sent-131, score-2.148]
65 A classification is correct if an input (x, y) is mapped to a point f (x) that lies in the neighborhood of the vertex cy . [sent-134, score-0.523]
66 4 Relaxation Error Analysis If we consider the simplex coding, a function f taking values in RT −1 , and the decoding operator D, the misclassification risk can also be written as: R(D(f )) = X (1 − ρD(f (x)) )dρX (x). [sent-137, score-0.621]
67 Then, following a relaxation approach, we replace the misclassification loss by the expected risk induced by one of the loss functions V defined in the previous section. [sent-138, score-0.743]
68 As in the binary case we consider p the expected loss E(f ) = V (y, f (x))dρ(x, y). [sent-139, score-0.274]
69 The following theorem studies the relaxation error for SH-SVM, SC-SVM, and S-LS loss functions. [sent-141, score-0.39]
70 For SH-SVM, SC-SVM, and S-LS loss functions, there exists a p such that E : Lp (X , ρX ) → R+ is convex and continuous. [sent-143, score-0.242]
71 Loss SH-SVM SC-SVM p 1 1 K conv(C) RT −1 S-LS 2 fρ cbρ cbρ RT −1 y∈Y CT T −1 T −1 ρy cy 2(T −1) T α 1 1 1 2 Table 1: conv(C) is the convex hull of the set C defined in (1). [sent-150, score-0.561]
72 Fix q > 0, we say that the distribution ρ satisfies the multiclass noise condition with parameter Bq , if ρX x∈X |0≤ T −1 ( cD(fρ (x)) − cj , fρ (x) ) ≤ s T j=D(fρ (x)) min where s ∈ [0, 1]. [sent-155, score-0.249]
73 5 ≤ Bq sq , (2) If a distribution ρ is characterized by a very large q, then, for each x ∈ X , fρ (x) is arbitrarily close to one of the coding vectors. [sent-156, score-0.178]
74 The comparison inequalities given in Theorems 1 and 2 can be used to derive generalization bounds on the excess misclassification risk. [sent-166, score-0.16]
75 5 Computational Aspects and Regularization Algorithms The simplex coding framework allows us to extend batch and online kernel methods to the Multiclass setting. [sent-171, score-0.684]
76 We begin by noting that the simplex coding can be easily com1 u puted via the recursion: C[i+1] = , C[2] = [1−1], where u = (− 1 · · ·− 1 ) i i v C[i] × 1 − i1 2 (column vector in Ri ) and v = (0, . [sent-173, score-0.59]
77 The above algorithm stems from the observation that the simplex in RT −1 can be obtained by projecting the simplex in RT onto the hyperplane orthogonal to the element (1, . [sent-182, score-0.918]
78 We consider regularized methods of the form (1), induced by simplex loss functions and where the hypothesis space is a vector-valued reproducing kernel Hilbert space H(VV-RKHS) the regularizer is the corresponding norm ||f ||2 . [sent-187, score-0.772]
79 S-RLS is obtained by substituting the simplex least square loss in the Tikhonov functional. [sent-201, score-0.678]
80 2), for the SC-SVM the coefficients in the representer theorem are given by y y ai = − y=yi αi cy , i = 1, . [sent-223, score-0.536]
81 , n, y ∈ Y y 1 where Gy,y = cy , cy ∀y, y ∈ Y and C0 = 2nλ , αi = (αi )y∈Y ∈ RT , for i = 1, . [sent-235, score-0.952]
82 Here, we omit this derivation and observe instead that if we neglect the convex hull constraint from Theorem 1, that requires f (x) ∈ co(C) for almost all x ∈ X , then the SH-SVM has an especially simple formulation at the price of loosing consistency guarantees. [sent-241, score-0.165]
83 In fact, in this case the coefficients are given by ai = αi cyi , i = 1, . [sent-242, score-0.126]
84 3 where C0 = The latter formulation can be solved at the same complexity of the binary SVM (worst case O(n )) but lacks consistency. [sent-255, score-0.182]
85 Online/Incremental Optimization The regularized estimators induced by the simplex loss functions can be computed by means of online/incremental first order (sub) gradient methods. [sent-256, score-0.772]
86 ) The algorithm depends on the used loss function through computation of the (point-wise) subgradient ∂(V ). [sent-260, score-0.194]
87 The latter can be easily computed for all the loss functions previously discussed. [sent-261, score-0.272]
88 For the SLS loss we have ∂(V (yi , fW (xi ))) = 2(cyi − W xi )xi , while for the SC1 SVM loss we have ∂(V (yi , fW (xi ))) = ( k∈Ii ck )xi where Ii = {y = yi | cy , W xi > − T −1 }. [sent-262, score-0.991]
89 For the SH-SVM loss we have: ∂(V (y, fW (xi ))) = −cyi xi if cyi W xi < 1 and 0 otherwise . [sent-263, score-0.378]
90 For explicit p−dimensional feature maps the cost is O(np2 ), so that the cost of computing the regularization path for simplex RLS algorithm is O(min(n3 , np2 )) and hence independent of T . [sent-267, score-0.521]
91 Simplex SVMs can be solved using solvers available for binary SVMs that are considered to have complexity O(nγ ) with γ ∈ {2, 3}(the actual complexity scales with the number of support vectors) . [sent-269, score-0.215]
92 SH-SVM in which we omit the constraint, can be trained with the same complexity as the binary SVM (worst case O(n3 )) but lacks consistency. [sent-271, score-0.148]
93 The online algorithms induced by the different simplex loss functions are essentially the same. [sent-273, score-0.758]
94 We compare the performance of our algorithms to one versus all svm (libsvm) , as well as simplex- based boosting [16]. [sent-276, score-0.244]
95 SC-SVM Online (ho) SH-SVM Online (ho) S-LS Online (ho) S-LS Batch (loo) S-LS rbf Batch (loo) SVM batch ova (ho) SVM rbf batch ova (ho) Simplex boosting [16] Landsat 65. [sent-282, score-0.439]
96 More generally, we see that rbf S- LS has the best performance amongst the simplex methods, including simplex boosting [16]. [sent-339, score-1.029]
97 Reducing multiclass to binary: a unifying approach for margin classifiers. [sent-344, score-0.318]
98 Consistency of multiclass empirical risk minimization methods based in convex loss. [sent-360, score-0.404]
99 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-365, score-0.249]
100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-493, score-0.203]
wordName wordTfidf (topN-words)
[('cy', 0.476), ('simplex', 0.445), ('multiclass', 0.249), ('rt', 0.227), ('loss', 0.194), ('relaxation', 0.169), ('misclassi', 0.15), ('coding', 0.145), ('svm', 0.139), ('multicategory', 0.13), ('risk', 0.107), ('cyi', 0.098), ('excess', 0.095), ('classi', 0.093), ('ho', 0.087), ('binary', 0.08), ('boosting', 0.077), ('tsybakov', 0.075), ('floo', 0.073), ('fsi', 0.073), ('decoding', 0.069), ('losses', 0.069), ('margin', 0.069), ('bq', 0.068), ('fisher', 0.067), ('yf', 0.065), ('ova', 0.065), ('cation', 0.064), ('squares', 0.063), ('rbf', 0.062), ('loo', 0.06), ('pub', 0.06), ('regularized', 0.054), ('batch', 0.054), ('rule', 0.053), ('fw', 0.051), ('vy', 0.051), ('wtmp', 0.049), ('consistency', 0.048), ('kij', 0.048), ('convex', 0.048), ('vertex', 0.047), ('suitable', 0.045), ('functions', 0.044), ('sign', 0.044), ('exponent', 0.044), ('xi', 0.043), ('valued', 0.043), ('explicit', 0.042), ('hinge', 0.042), ('tong', 0.042), ('bayes', 0.041), ('extensions', 0.041), ('yi', 0.041), ('online', 0.04), ('square', 0.039), ('map', 0.039), ('rs', 0.039), ('considered', 0.037), ('hull', 0.037), ('rosasco', 0.036), ('fn', 0.036), ('complexity', 0.035), ('inequalities', 0.035), ('induced', 0.035), ('theorems', 0.035), ('maxy', 0.034), ('cone', 0.034), ('nt', 0.034), ('ct', 0.034), ('latter', 0.034), ('regularization', 0.034), ('rn', 0.033), ('lp', 0.033), ('lacks', 0.033), ('tikhonov', 0.033), ('sq', 0.033), ('appendix', 0.033), ('surrogate', 0.032), ('remark', 0.032), ('price', 0.032), ('representer', 0.032), ('erm', 0.032), ('cb', 0.032), ('ni', 0.032), ('cients', 0.031), ('leave', 0.03), ('comparison', 0.03), ('conv', 0.03), ('jmlr', 0.029), ('yoram', 0.028), ('hyperplane', 0.028), ('ls', 0.028), ('qp', 0.028), ('support', 0.028), ('versus', 0.028), ('ai', 0.028), ('inequality', 0.027), ('error', 0.027), ('minimizer', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
2 0.27137554 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
Author: Elad Hazan, Zohar Karnin
Abstract: We present a simplex algorithm for linear programming in a linear classification formulation. The paramount complexity parameter in linear classification problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1
3 0.20642208 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
4 0.16309291 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
5 0.14988825 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
Author: Mert Pilanci, Laurent E. Ghaoui, Venkat Chandrasekaran
Abstract: We consider the problem of cardinality penalized optimization of a convex function over the probability simplex with additional convex constraints. The classical 1 regularizer fails to promote sparsity on the probability simplex since 1 norm on the probability simplex is trivially constant. We propose a direct relaxation of the minimum cardinality problem and show that it can be efficiently solved using convex programming. As a first application we consider recovering a sparse probability measure given moment constraints, in which our formulation becomes linear programming, hence can be solved very efficiently. A sufficient condition for exact recovery of the minimum cardinality solution is derived for arbitrary affine constraints. We then develop a penalized version for the noisy setting which can be solved using second order cone programs. The proposed method outperforms known rescaling heuristics based on 1 norm. As a second application we consider convex clustering using a sparse Gaussian mixture and compare our results with the well known soft k-means algorithm. 1
6 0.14593209 197 nips-2012-Learning with Recursive Perceptual Representations
7 0.1417647 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
8 0.13471144 16 nips-2012-A Polynomial-time Form of Robust Regression
9 0.12736414 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
10 0.10682786 280 nips-2012-Proper losses for learning from partial labels
11 0.10245191 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
12 0.098646052 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
13 0.093555257 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
14 0.09243083 188 nips-2012-Learning from Distributions via Support Measure Machines
15 0.086514145 361 nips-2012-Volume Regularization for Binary Classification
16 0.085531034 242 nips-2012-Non-linear Metric Learning
17 0.082284778 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
18 0.079454176 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
19 0.078300752 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
20 0.077123843 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
topicId topicWeight
[(0, 0.225), (1, 0.028), (2, 0.074), (3, -0.068), (4, 0.164), (5, 0.031), (6, 0.005), (7, 0.21), (8, -0.065), (9, -0.028), (10, 0.105), (11, 0.093), (12, 0.076), (13, 0.079), (14, 0.054), (15, -0.099), (16, -0.105), (17, -0.012), (18, -0.014), (19, 0.123), (20, -0.079), (21, 0.073), (22, -0.097), (23, 0.118), (24, -0.109), (25, 0.03), (26, -0.133), (27, -0.034), (28, -0.031), (29, -0.0), (30, 0.0), (31, 0.042), (32, 0.0), (33, 0.044), (34, 0.099), (35, 0.049), (36, 0.113), (37, 0.042), (38, -0.044), (39, -0.079), (40, -0.014), (41, 0.125), (42, 0.069), (43, -0.146), (44, -0.061), (45, -0.099), (46, -0.006), (47, 0.005), (48, 0.026), (49, -0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96072596 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
2 0.85792232 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
Author: Elad Hazan, Zohar Karnin
Abstract: We present a simplex algorithm for linear programming in a linear classification formulation. The paramount complexity parameter in linear classification problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1
3 0.81412238 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
4 0.71783066 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Author: Aharon Birnbaum, Shai S. Shwartz
Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1
5 0.71307027 361 nips-2012-Volume Regularization for Binary Classification
Author: Koby Crammer, Tal Wagner
Abstract: We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1
6 0.68190902 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
7 0.68170935 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
8 0.66665095 280 nips-2012-Proper losses for learning from partial labels
9 0.63474083 217 nips-2012-Mixability in Statistical Learning
10 0.62401998 16 nips-2012-A Polynomial-time Form of Robust Regression
11 0.62282187 200 nips-2012-Local Supervised Learning through Space Partitioning
12 0.58122677 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
13 0.57039589 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
14 0.5492692 197 nips-2012-Learning with Recursive Perceptual Representations
15 0.53444195 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
16 0.52602786 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
17 0.50297821 271 nips-2012-Pointwise Tracking the Optimal Regression Function
18 0.49686027 161 nips-2012-Interpreting prediction markets: a stochastic approach
19 0.4658848 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
20 0.46454895 30 nips-2012-Accuracy at the Top
topicId topicWeight
[(0, 0.037), (17, 0.011), (21, 0.025), (38, 0.203), (39, 0.013), (42, 0.047), (54, 0.015), (55, 0.04), (60, 0.171), (68, 0.013), (74, 0.042), (76, 0.107), (80, 0.132), (92, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.88791895 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
2 0.88754994 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Author: Xue-xin Wei, Alan Stocker
Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More specifically, we show that an efficient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efficient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reflect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustified post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been defined as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the specific perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a firing rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efficient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that finding appropriate constraints is generally difficult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efficient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efficient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the specific response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efficient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simplified neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efficient encoding Efficient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different definitions of “optimally” are possible, and may lead to different results. Here, we assume an efficient representation that maximizes the mutual information between the sensory variable and the population response. With this definition and an upper limit on the total firing activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This defines the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efficient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efficient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efficient coding defines the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average firing rates are all uniform in the homogeneous space. illustrates the applied efficient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efficient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efficiently represents a stimulus variable θ as described above. A stimulus θ0 elicits a specific population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual firing rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then define a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efficient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efficient neural encoding, and is generally not symmetric for non-flat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efficient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Specifically, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We first re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we define ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is defined as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we find ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufficiently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modifies the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive influence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efficient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efficient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we defined an efficient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous firing rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efficient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as defined in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efficient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides specific predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efficient (en)coding and Bayesian decoding. Efficient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efficient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efficient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at first sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efficient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation specificity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efficient neural population codes. In Lecture Notes in Computer Science, Artificial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efficient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9
3 0.86299336 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
4 0.86154205 118 nips-2012-Entangled Monte Carlo
Author: Seong-hwan Jun, Liangliang Wang, Alexandre Bouchard-côté
Abstract: We propose a novel method for scalable parallelization of SMC algorithms, Entangled Monte Carlo simulation (EMC). EMC avoids the transmission of particles between nodes, and instead reconstructs them from the particle genealogy. In particular, we show that we can reduce the communication to the particle weights for each machine while efficiently maintaining implicit global coherence of the parallel simulation. We explain methods to efficiently maintain a genealogy of particles from which any particle can be reconstructed. We demonstrate using examples from Bayesian phylogenetic that the computational gain from parallelization using EMC significantly outweighs the cost of particle reconstruction. The timing experiments show that reconstruction of particles is indeed much more efficient as compared to transmission of particles. 1
5 0.83759153 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
6 0.83127159 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
7 0.8283335 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
8 0.82805634 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
9 0.8247112 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
10 0.8245734 65 nips-2012-Cardinality Restricted Boltzmann Machines
11 0.82376397 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
12 0.82322371 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
13 0.82040149 324 nips-2012-Stochastic Gradient Descent with Only One Projection
14 0.81979179 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
15 0.8184908 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
16 0.81843019 293 nips-2012-Relax and Randomize : From Value to Algorithms
17 0.81789851 358 nips-2012-Value Pursuit Iteration
18 0.81715971 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
19 0.81715709 187 nips-2012-Learning curves for multi-task Gaussian process regression
20 0.81713617 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration