nips nips2012 nips2012-217 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. [sent-12, score-0.085]
2 We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. [sent-14, score-0.999]
3 Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. [sent-15, score-2.138]
4 We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. [sent-16, score-0.95]
5 In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. [sent-17, score-0.215]
6 The goal is to select a function f that maps X to a prediction f (X) of ⇤ ˆ is measured by its excess risk, which Y for a new pair (X, Y ) from the same P . [sent-22, score-0.092]
7 The quality of f ˆ is the expectation of its loss `(Y, f (X)) minus the expected loss of the best prediction function f ⇤ in a given class of functions F. [sent-23, score-0.154]
8 In contrast, the setting of sequential prediction (also called online learning) [2] makes no probabilistic assumptions about the source of the data. [sent-25, score-0.085]
9 , fn is ˆ just before round t, which maps xt to a prediction of yt . [sent-32, score-0.265]
10 , `(yn , fn (xn )) on the ⇤ actual observations minus the total loss of the best fixed prediction function f in a class of functions F. [sent-36, score-0.312]
11 In sequential prediction the usual analysis involves giving guarantees about the performance of ˆ ˆ f1 , . [sent-37, score-0.085]
12 , fn in the worst case over all possible realisations of the data. [sent-40, score-0.206]
13 When stating rates of convergence, we will divide the worst-case regret by n, which makes the rates comparable to rates in the statistical learning setting. [sent-41, score-0.236]
14 Mapping out the relations between statistical learning and sequential prediction is an active area of investigation, and several connections are known. [sent-42, score-0.108]
15 For example, using any of a variety of online1 ˆ ˆ to-batch conversion techniques [3], any sequential predictions f1 , . [sent-43, score-0.06]
16 , fn may be converted into a ˆ and the statistical performance of f is bounded by the sequential preˆ single statistical prediction f ˆ ˆ diction performance of f1 , . [sent-46, score-0.337]
17 Moreover, a deep understanding of the relation between worstcase rates in both settings is provided by Abernethy, Agarwal, Bartlett and Rakhlin [4]. [sent-50, score-0.076]
18 Amongst others, their results imply that for many loss functions the worst-case rate in sequential prediction exceeds the worst-case rate in statistical learning. [sent-51, score-0.156]
19 Fast Rates In sequential prediction with a finite class F, it is known that the worst-case regret can be bounded by a constant if and only if the loss ` has the property of being mixable [5, 6] (subject to mild regularity conditions on the loss). [sent-52, score-0.465]
20 First, p for 0/1-loss, fast rates (faster than O(1/ n)) are associated with Mammen and Tsybakov’s margin condition [7, 8], which depends on a parameter . [sent-55, score-0.208]
21 In the nicest case, = 1 and then O(1/n) rates are possible. [sent-56, score-0.063]
22 Second, for log(arithmic) loss there is a single supermartingale condition that is essential to obtain fast rates in all convergence proofs of two-part minimum description length (MDL) estimators, and in many convergence proofs of Bayesian estimators. [sent-57, score-0.297]
23 Finally, (d) if stochastic mixability does not hold, then in general O(log |Fn |/n)-statistical learning rates cannot be achieved, at least not for 0/1-loss or for log-loss. [sent-65, score-1.0]
24 it contains all prediction functions for the given loss, then stochastic mixability turns out to be formally equivalent to ordinary mixability (if F is not full, then either condition may hold without the other). [sent-68, score-2.017]
25 Our contributions are to generalize these results, and to relate them to each other, to the notion of mixability from sequential prediction, and to the interpretation in terms of convexity of a set of pseudo-likelihoods. [sent-71, score-0.979]
26 This leads to our central conclusion: the concept of stochastic mixability is closely related to mixability and plays a fundamental role in achieving fast rates in the statistical learning setting. [sent-72, score-1.911]
27 Outline In §2 we define both ordinary mixability and stochastic mixability. [sent-73, score-1.014]
28 We show that two of the standard ways to express mixability have natural analogues that express stochastic mixability (leading to (f)). [sent-74, score-1.825]
29 A third interpretation of mixability and standard mixability in terms of sets (g) is described in §3. [sent-76, score-1.757]
30 The equivalence between mixability 2 and stochastic mixability if F is full is presented in §4 where we also show that the equivalence need not hold if F is not full (e). [sent-77, score-1.845]
31 In §5, we turn our attention to a version of the margin condition that does not assume that F contains the Bayes optimal predictor and we show that (a slightly relaxed version of) stochastic mixability is equivalent to the margin condition, taking care of (a). [sent-78, score-1.167]
32 We show (§6) that if stochastic mixability holds, O(log |Fn |/n)-rates can always be achieved (c), and that in some cases in which it does not hold, O(log |Fn |/n)-rates cannot be achieved (d). [sent-79, score-0.937]
33 2 Mixability and Stochastic Mixability We now introduce the notions of mixability and stochastic mixability, showing two equivalent formulations of the latter. [sent-82, score-0.954]
34 For ⌘ > 0, a loss ` is called ⌘-mixable if for any distribution ⇡ on A there exists a single prediction a⇡ such that Z 1 `(y, a⇡ ) ln e ⌘`(y,a) ⇡(da) for all y. [sent-88, score-0.147]
35 (1) ⌘ It is called mixable if there exists an ⌘ > 0 such that it is ⌘-mixable. [sent-89, score-0.324]
36 For example, if A = Y = {0, 1} and the loss is the 0/1-loss, `0/1 (y, a) = 1{y 6= a}, then the predictors are classifiers. [sent-95, score-0.084]
37 For any ⌘ 0, we say that (`, F, P ⇤ ) is ⌘-stochastically mixable if there exists an f ⇤ 2 F such that ⌘`(Y,f (X)) e E 1 for all f 2 F. [sent-100, score-0.324]
38 (3) e ⌘`(Y,f ⇤ (X)) We call (`, F, P ⇤ ) stochastically mixable if there exists an ⌘ > 0 such that it is ⌘-stochastically mixable. [sent-101, score-0.39]
39 h ⌘`(Y,f (X)) i ⇤ By Jensen’s inequality, we see that (3) implies 1 E ee ⌘`(Y,f ⇤ (X)) eE[⌘(`(Y,f (X)) `(Y,f (X)))] , so that E[`(Y, f ⇤ (X))] E[`(Y, f (X)))] for all f 2 F, and hence the definition of stochastic mixability presumes that f ⇤ minimizes E[`(Y, f (X))] over all f 2 F. [sent-102, score-0.97]
40 it contains the true conditional density p⇤ (y | x), then, because the log-loss is a proper loss [17] we must have f ⇤ = p⇤ and then, for ⌘ = 1, trivially A⌘ (f kf ⇤ ) = 1 for all f 2 F. [sent-118, score-0.093]
41 Thus if the model F is correct, then the log-loss is ⌘-stochastically mixable for ⌘ = 1. [sent-119, score-0.308]
42 Equation 4 — which just expresses 1-stochastic mixability for log-loss — is used in all previous convergence theorems for 2-part MDL density estimation [10, 12, 11, 18], and, more implicitly, in various convergence theorems for Bayesian procedures, including the pioneering paper by Doob [9]. [sent-121, score-0.904]
43 For example, as first noted by [12], if F is a convex set of densities, then (4) also holds for ⌘ = 1, even if the model is incorrect, and, indeed, two-part MDL converges at fast rates in such cases (see [14] for a precise definition of what this means, as well as more general treatment of (4)). [sent-123, score-0.115]
44 Kleijn and Van der Vaart [13], in their extensive analysis of Bayesian nonparametric inference if the model is wrong, also use the fact that (4) holds with ⌘ = 1 for convex models to show that fast posterior concentration rates hold for such models even if they do not contain the true p⇤ . [sent-124, score-0.133]
45 The definition of stochastic mixability looks similar to (2), but whereas ⇡ is a distribution on predictions, P ⇤ is a distribution on outcomes (X, Y ). [sent-125, score-0.966]
46 It is therefore quite surprising that stochastic mixability can also be expressed in a way that looks like (1), which provides a first hint that the relation goes deeper. [sent-127, score-0.979]
47 Then (`, F, P ⇤ ) is ⌘-stochastically mixable if and only if for any distribution ⇡ on F there exists a single predictor f ⇤ 2 F such that Z ⇥ ⇤ 1 E `(Y, f ⇤ (X)) E ln e ⌘`(Y,f (X)) ⇡(df ) . [sent-130, score-0.39]
48 3 The Convexity Interpretation There is a third way to express mixability, as the convexity of a set of so-called pseudo-likelihoods. [sent-133, score-0.059]
49 We will now show that stochastic mixability can also be interpreted as convexity of the corresponding set in the statistical learning setting. [sent-134, score-1.019]
50 [15], we first note that the essential feature of a loss ` with corresponding set of predictions A is the set of achievable losses they induce: L = {l : Y ! [sent-136, score-0.135]
51 If we would reparametrize the loss by a different set of predictions A0 , while keeping L the same, then essentially nothing would change. [sent-138, score-0.069]
52 For example, for 0/1-loss standard ways to parametrize predictions are by A = {0, 1}, by A = { 1, +1} or by A = R with the interpretation that predicting a 0 maps to the prediction 1 and a < 0 maps to the prediction 0. [sent-139, score-0.168]
53 And like for the first two expressions of mixability, there is an analogous convexity interpretation for stochastic mixability. [sent-151, score-0.149]
54 In order to define pseudo-likelihoods in the statistical setting, we need to take into account that the predictions f (X) of the predictors in F are not deterministic, but depend on X. [sent-152, score-0.08]
55 ) There is no need to introduce a conditional analogue of the super prediction set. [sent-155, score-0.106]
56 Then ⌘-stochastic mixability of (`, F, P ⇤ ) is equivalent to the requirement that ⇥ ⇤ ⇥ ⇤ min E ⌘1 ln p(Y |X) = min E ⌘1 ln p(Y |X) . [sent-161, score-0.974]
57 Equation 6 expresses that the convex hull operator has no effect, which means that PF (⌘) looks convex from the perspective of P ⇤ . [sent-165, score-0.069]
58 Thus we obtain an interpretation of ⌘-stochastic mixability as effective convexity of the set of pseudo-likelihoods PF (⌘) with respect to P ⇤ . [sent-167, score-0.94]
59 Figure 1 suggests that f ⇤ should be unique if the loss is stochastically mixable, which is almost right. [sent-168, score-0.114]
60 If (`, F, P ⇤ ) is stochastically mixable and there exist f ⇤ , g ⇤ 2 F such that E[`(Y, f ⇤ (X))] = E[`(Y, g ⇤ (X))] = minf 2F E[`(Y, f (X))], then `(Y, f ⇤ (X)) = `(Y, g ⇤ (X)) almost surely. [sent-170, score-0.374]
61 Then, by Theorem 2 and (strict) convexity of ln, ⇤ ⇤ 1 1 1 ln e ⌘`(Y,f (X)) + e ⌘`(Y,g (X)) min E[`(Y, f (X))] E f 2F ⌘ 2 2 1 1 E `(Y, f ⇤ (X)) + `(Y, g ⇤ (X)) = min E[`(Y, f (X))]. [sent-173, score-0.084]
62 4 When Mixability and Stochastic Mixability Are the Same Having observed that mixability and stochastic mixability of a loss share several common features, we now show that in specific cases the two concepts even coincide. [sent-176, score-1.849]
63 More specifically, Theorem 5 below shows that a loss ` (meeting two requirements) is ⌘-mixable if and only if it is ⌘-stochastically mixable relative to Ffull , the set of all functions from X to A, and all distributions P ⇤ . [sent-177, score-0.369]
64 We say (`, F) is ⌘-stochastically mixable if (`, F, P ⇤ ) is ⌘-stochastically mixable for all distributions P ⇤ on X ⇥ Y. [sent-186, score-0.616]
65 Let ⌘ > 0 and suppose ` is a loss such that its pseudolikelihood set e ⌘S is closed and pre-supportable. [sent-189, score-0.062]
66 Then (`, Ffull ) is ⌘-stochastically mixable if and only if ` is ⌘-mixable. [sent-190, score-0.308]
67 Their conditions also imply (by their Lemma 10) that the loss ` is proper, which implies that e ⌘S is closed and pre-supportable. [sent-194, score-0.067]
68 The first establishes conditions for when mixability implies stochastic mixability, borrowing from a similar result for log-loss by Li [12]. [sent-197, score-0.968]
69 The second lemma shows that stochastic mixability implies mixability. [sent-202, score-0.971]
70 The above two lemmata are sufficient to prove the equivalence of stochastic and ordinary mixability. [sent-206, score-0.163]
71 In order to show that ⌘-mixability of ` implies ⌘-stochastic mixability of ⇤ (`, Ffull ) we note that the Bayes-optimal predictor fB for any ` and P ⇤ must be in Ffull and so ⇤ Lemma 6 implies (`, Ffull , P ) is ⌘-stochastically mixable for any distribution P ⇤ . [sent-208, score-1.239]
72 Conversely, that ⌘-stochastic mixability of (`, Ffull ) implies the ⌘-mixability of ` follows immediately from Lemma 7. [sent-209, score-0.883]
73 In this case, we can have either stochastic mixability without ordinary mixability or the converse. [sent-211, score-1.878]
74 Consider a loss function ` that is not mixable in the ordinary sense, e. [sent-212, score-0.433]
75 Then clearly ` is stochastically mixable relative to F. [sent-215, score-0.387]
76 We do not know whether we can have stochastic mixability without ordinary mixability in nontrivial cases, and plan to investigate this for future work. [sent-217, score-1.878]
77 Because 0/1-loss is not standard mixable, by Theorem 5, 0/1-loss is not stochastically mixable relative to ⇥. [sent-224, score-0.387]
78 But then we must also have that log-loss is not stochastically mixable relative to F. [sent-225, score-0.387]
79 5 Stochastic Mixability and the Margin Condition The excess risk of any f compared to f ⇤ is the mean of the excess loss `(Y, f (X)) ⇥ ⇤ d(f, f ⇤ ) = E `(Y, f (X)) `(Y, f ⇤ (X)) . [sent-226, score-0.13]
80 The margin condition, introduced by Mammen and Tsybakov [7, 8] for 0/1-loss, is satisfied with constants 1 and c0 > 0 if c0 V (f, f ⇤ ) d(f, f ⇤ ) for all f 2 F. [sent-229, score-0.063]
81 In some practical cases, the margin condition only holds for a subset of the model such that V (f, f ⇤ ) ✏0 for some ✏0 > 0 [8]. [sent-233, score-0.136]
82 Stochastic mixability, as we have defined it, is directly related to the margin condition for the case = 1. [sent-235, score-0.121]
83 In order to relate it to other values of , we need a little more flexibility: for given ✏ 0 and (`, F, P ⇤ ), we define F✏ = {f ⇤ } [ {f 2 F | d(f, f ⇤ ) ✏}, (8) which excludes a band of predictors that approximate the best predictor in the model to within excess risk ✏. [sent-236, score-0.114]
84 Then the margin condition (7) is satisfied if and only if there exists a constant C > 0 such that, for all ✏ > 0, (`, F✏ , P ⇤ ) is ⌘-stochastically mixable for ⌘ = C✏( 1)/ . [sent-240, score-0.445]
85 In particular, if the margin condition is satisfied with constants and c0 , we can take C = min 1/ V 2 c0 eV V , 1 V ( 1 1)/ . [sent-241, score-0.121]
86 This theorem gives a new interpretation of the margin condition as the rate at which ⌘ has to go to 0 when the model F is approximated by ⌘-stochastically mixable models F✏ . [sent-242, score-0.483]
87 By the following corollary, proved in the additional material, stochastic mixability of the whole model F is equivalent to the best case of the margin condition. [sent-243, score-1.017]
88 Then (`, F, P ⇤ ) is stochastically mixable if and only if there exists a constant c0 > 0 such that the margin condition (7) is satisfied with = 1. [sent-246, score-0.511]
89 Let, for all n, Pn be any set of distributions on X ⇥ Y such that for all P ⇤ 2 Pn , the generalized margin condition (7) holds for = 1 and uniform constant c0 not depending on n, with model Fn . [sent-249, score-0.157]
90 Tsybakov [8] suggest that there exist estimators 7 ˆ fn : (X ⇥ Y)n ! [sent-252, score-0.221]
91 of Zhang [21] and with fn set to Zhang’s information-risk-minimization estimator (to see this, at sample size n apply Zhang’s result with ↵ set to 0 and a prior ⇡ that is uniform on Fn , so that log ⇡(f ) = log |Fn | for any f 2 Fn ). [sent-256, score-0.271]
92 By Theorem 8, this means that, for any bounded loss function `, if, for some ⌘ > 0, all n, we have that (`, Fn , P ⇤ ) is ⌘-stochastically mixable for all P ⇤ 2 Pn , then Zhang’s estimator satisfies (9). [sent-257, score-0.356]
93 Hence, for bounded loss functions, stochastic mixability implies a uniform O(log |Fn |/n) rate. [sent-258, score-1.025]
94 We just explained that, if ` is stochastically mixable relative to Fn , then uniform O(log |Fn |/n) rates can be achieved. [sent-262, score-0.471]
95 We now illustrate that if this is not the case, then, at least if ` is 0/1-loss or if ` is log-loss, uniform O(log |Fn |/n) rates cannot be achieved in general. [sent-263, score-0.084]
96 This establishes that if stochastic mixability does not hold, then uniform rates of O(log |Fn |/n) are not achievable in general for 0/1-loss. [sent-282, score-1.053]
97 Equation 3 looks completely different from the margin condition, yet results connecting the two, somewhat similar to (a), albeit very implicitly, already appear in [23] and [24]. [sent-286, score-0.092]
98 Also, the paper by Gr¨ nwald [14] contains a connection between the margin condition somewhat similar to Theorem 8, u but involving a significantly weaker version of stochastic mixability in which the inequality (3) only holds with some slack. [sent-287, score-1.094]
99 [25], who showed the role of convexity of F for fast rates in the regression setting with squared loss. [sent-297, score-0.134]
100 A stochastic view of optimal regret through minimax duality. [sent-330, score-0.097]
wordName wordTfidf (topN-words)
[('mixability', 0.864), ('mixable', 0.308), ('fn', 0.206), ('pf', 0.087), ('ffull', 0.083), ('ordinary', 0.077), ('stochastic', 0.073), ('stochastically', 0.066), ('rates', 0.063), ('margin', 0.063), ('chernov', 0.063), ('supermartingale', 0.059), ('condition', 0.058), ('mammen', 0.052), ('mdl', 0.052), ('loss', 0.048), ('convexity', 0.047), ('prediction', 0.046), ('tsybakov', 0.045), ('sequential', 0.039), ('ln', 0.037), ('predictors', 0.036), ('kalnishkan', 0.036), ('losses', 0.033), ('super', 0.033), ('excess', 0.033), ('interpretation', 0.029), ('looks', 0.029), ('predictor', 0.029), ('gr', 0.028), ('pn', 0.027), ('theorem', 0.025), ('fb', 0.025), ('regret', 0.024), ('fast', 0.024), ('nicta', 0.024), ('copf', 0.024), ('epy', 0.024), ('erven', 0.024), ('kleijn', 0.024), ('zhdanov', 0.024), ('statistical', 0.023), ('log', 0.022), ('paris', 0.022), ('hellinger', 0.021), ('anu', 0.021), ('nwald', 0.021), ('uniform', 0.021), ('predictions', 0.021), ('achievable', 0.02), ('zhang', 0.019), ('arlot', 0.019), ('implies', 0.019), ('requirement', 0.019), ('hold', 0.018), ('probabilit', 0.018), ('corollary', 0.018), ('bartlett', 0.018), ('universit', 0.017), ('colt', 0.017), ('bayes', 0.017), ('equivalent', 0.017), ('kf', 0.016), ('abernethy', 0.016), ('risk', 0.016), ('af', 0.016), ('proofs', 0.016), ('proper', 0.016), ('exists', 0.016), ('tim', 0.016), ('slack', 0.015), ('holds', 0.015), ('lemma', 0.015), ('estimators', 0.015), ('suppose', 0.014), ('australia', 0.014), ('py', 0.014), ('reid', 0.014), ('expresses', 0.014), ('analogue', 0.014), ('ee', 0.014), ('nition', 0.014), ('densities', 0.014), ('equivalence', 0.013), ('essential', 0.013), ('williamson', 0.013), ('bousquet', 0.013), ('conditional', 0.013), ('convex', 0.013), ('relation', 0.013), ('annals', 0.013), ('relative', 0.013), ('theorems', 0.013), ('maps', 0.013), ('express', 0.012), ('establishes', 0.012), ('van', 0.012), ('interpreted', 0.012), ('surely', 0.012), ('minus', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
2 0.058796067 254 nips-2012-On the Sample Complexity of Robust PCA
Author: Matthew Coudron, Gilad Lerman
Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1
3 0.050542355 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
4 0.048991106 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
5 0.048089128 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
Author: Chi Jin, Liwei Wang
Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.
6 0.047550857 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
7 0.047502942 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
8 0.04333052 324 nips-2012-Stochastic Gradient Descent with Only One Projection
9 0.040617414 37 nips-2012-Affine Independent Variational Inference
10 0.039872326 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
11 0.039803818 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
12 0.037923101 16 nips-2012-A Polynomial-time Form of Robust Regression
13 0.037570775 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
14 0.036114447 280 nips-2012-Proper losses for learning from partial labels
15 0.034845136 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
16 0.0333842 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
17 0.03161978 179 nips-2012-Learning Manifolds with K-Means and K-Flats
18 0.029378645 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
19 0.029103529 275 nips-2012-Privacy Aware Learning
20 0.028142193 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
topicId topicWeight
[(0, 0.083), (1, 0.003), (2, 0.052), (3, 0.002), (4, 0.037), (5, 0.004), (6, -0.008), (7, 0.033), (8, -0.015), (9, 0.004), (10, 0.018), (11, 0.012), (12, -0.014), (13, -0.018), (14, -0.007), (15, -0.028), (16, -0.001), (17, -0.014), (18, -0.02), (19, 0.056), (20, -0.007), (21, 0.007), (22, 0.03), (23, 0.024), (24, -0.035), (25, 0.011), (26, -0.038), (27, 0.008), (28, -0.061), (29, -0.02), (30, 0.071), (31, 0.001), (32, 0.027), (33, 0.01), (34, 0.043), (35, -0.006), (36, 0.003), (37, 0.014), (38, -0.001), (39, -0.034), (40, 0.045), (41, 0.002), (42, -0.006), (43, -0.017), (44, -0.013), (45, -0.016), (46, 0.037), (47, 0.013), (48, 0.1), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.8643136 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
2 0.66576368 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Author: Aharon Birnbaum, Shai S. Shwartz
Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1
3 0.61825043 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
4 0.58830142 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
5 0.5646432 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
6 0.54591393 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
7 0.53199279 16 nips-2012-A Polynomial-time Form of Robust Regression
8 0.53147453 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
9 0.51890081 285 nips-2012-Query Complexity of Derivative-Free Optimization
10 0.51793516 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
11 0.5110724 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
12 0.50418591 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
13 0.46849674 161 nips-2012-Interpreting prediction markets: a stochastic approach
14 0.4572514 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
15 0.4553712 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
16 0.45469528 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability
17 0.44620275 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
18 0.44563642 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
19 0.44355273 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
20 0.43858403 179 nips-2012-Learning Manifolds with K-Means and K-Flats
topicId topicWeight
[(0, 0.021), (21, 0.03), (27, 0.015), (38, 0.142), (39, 0.012), (42, 0.055), (54, 0.035), (55, 0.02), (60, 0.248), (74, 0.032), (76, 0.148), (80, 0.074), (92, 0.037)]
simIndex simValue paperId paperTitle
1 0.81493276 118 nips-2012-Entangled Monte Carlo
Author: Seong-hwan Jun, Liangliang Wang, Alexandre Bouchard-côté
Abstract: We propose a novel method for scalable parallelization of SMC algorithms, Entangled Monte Carlo simulation (EMC). EMC avoids the transmission of particles between nodes, and instead reconstructs them from the particle genealogy. In particular, we show that we can reduce the communication to the particle weights for each machine while efficiently maintaining implicit global coherence of the parallel simulation. We explain methods to efficiently maintain a genealogy of particles from which any particle can be reconstructed. We demonstrate using examples from Bayesian phylogenetic that the computational gain from parallelization using EMC significantly outweighs the cost of particle reconstruction. The timing experiments show that reconstruction of particles is indeed much more efficient as compared to transmission of particles. 1
same-paper 2 0.78474075 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
3 0.77190703 201 nips-2012-Localizing 3D cuboids in single-view images
Author: Jianxiong Xiao, Bryan Russell, Antonio Torralba
Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners. 1
4 0.77014601 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Author: Xue-xin Wei, Alan Stocker
Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More specifically, we show that an efficient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efficient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reflect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustified post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been defined as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the specific perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a firing rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efficient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that finding appropriate constraints is generally difficult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efficient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efficient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the specific response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efficient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simplified neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efficient encoding Efficient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different definitions of “optimally” are possible, and may lead to different results. Here, we assume an efficient representation that maximizes the mutual information between the sensory variable and the population response. With this definition and an upper limit on the total firing activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This defines the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efficient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efficient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efficient coding defines the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average firing rates are all uniform in the homogeneous space. illustrates the applied efficient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efficient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efficiently represents a stimulus variable θ as described above. A stimulus θ0 elicits a specific population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual firing rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then define a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efficient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efficient neural encoding, and is generally not symmetric for non-flat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efficient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Specifically, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We first re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we define ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is defined as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we find ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufficiently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modifies the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive influence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efficient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efficient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we defined an efficient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous firing rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efficient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as defined in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efficient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides specific predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efficient (en)coding and Bayesian decoding. Efficient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efficient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efficient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at first sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efficient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation specificity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efficient neural population codes. In Lecture Notes in Computer Science, Artificial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efficient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9
5 0.7308057 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
6 0.67457473 38 nips-2012-Algorithms for Learning Markov Field Policies
7 0.67390877 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
8 0.67305595 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
9 0.67275429 275 nips-2012-Privacy Aware Learning
10 0.67258435 292 nips-2012-Regularized Off-Policy TD-Learning
11 0.6723932 179 nips-2012-Learning Manifolds with K-Means and K-Flats
12 0.67171353 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
13 0.67085004 34 nips-2012-Active Learning of Multi-Index Function Models
14 0.67067993 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
15 0.67032701 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
16 0.67029357 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
17 0.67004567 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
18 0.67003024 348 nips-2012-Tractable Objectives for Robust Policy Optimization
19 0.66975731 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
20 0.66947883 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation