nips nips2005 nips2005-47 knowledge-graph by maker-knowledge-mining

47 nips-2005-Consistency of one-class SVM and related algorithms


Source: pdf

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Consistency of one-class SVM and related algorithms R´ gis Vert e Laboratoire de Recherche en Informatique Universit´ Paris-Sud e 91405, Orsay Cedex, France Masagroup 24 Bd de l’Hˆ pital o 75005, Paris, France Regis. [sent-1, score-0.039]

2 fr Jean-Philippe Vert Geostatistics Center Ecole des Mines de Paris - ParisTech 77300 Fontainebleau, France Jean-Philippe. [sent-3, score-0.057]

3 Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. [sent-6, score-0.471]

4 These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. [sent-7, score-0.494]

5 (2) This framework encompasses in particular the classical support vector machine (SVM) [1] when φ(u) = max(1 − u, 0). [sent-15, score-0.066]

6 Recent years have witnessed important theoretical advances aimed at understanding the behavior of such regularized algorithms when n tends to infinity and λ decreases to 0. [sent-16, score-0.323]

7 In particular the consistency and convergence rates of the two-class SVM (see, e. [sent-17, score-0.391]

8 , [2, 3, 4] and references therein) have been studied in detail, as well as the shape of the asymptotic decision function [5, 6]. [sent-19, score-0.062]

9 All results published so far study the case where λ decreases as the number of points tends to infinity (or, equivalently, where λσ −d converges to 0 if one uses the classical non-normalized version of the Gaussian kernel instead of (2)). [sent-20, score-0.338]

10 Although it seems natural to reduce regularization as more and more training data are available –even more than natural, it is the spirit of regularization [7, 8]–, there is at least one important situation where λ is typically held fixed: the one-class SVM [9]. [sent-21, score-0.41]

11 The estimation is performed by thresholding the function output by the one-class SVM, that is, the SVM (1) with only positive examples; in that case λ is supposed to determine the quantile level1 . [sent-23, score-0.153]

12 Although it is known that the fraction of examples in the selected region converges to the desired quantile level α [9], it is still an open question whether the region converges towards a quantile, that is, a region of minimum volume. [sent-24, score-0.396]

13 The main contribution of this paper is to show that Bayes consistency can be obtained for algorithms that solve (1) without decreasing λ, if instead the bandwidth σ of the Gaussian kernel decreases at a suitable rate. [sent-26, score-0.577]

14 We prove upper bounds on the convergence rate of the classification error towards the Bayes risk for a variety of functions φ and of distributions P , in particular for SVM (Theorem 6). [sent-27, score-0.409]

15 Moreover, we provide an explicit description of the function asymptotically output by the algorithms, and establish converge rates towards this limit for the L2 norm (Theorem 7). [sent-28, score-0.383]

16 Due to lack of space we limit ourselves in this extended abstract to the statement of the main results (Section 2) and sketch the proof of the main theorem (Theorem 3) that underlies all other results in Section 3. [sent-30, score-0.549]

17 2 Notations and main results Let (X, Y ) be a pair of random variables taking values in Rd × {−1, 1}, with distribution P . [sent-32, score-0.05]

18 We assume throughout this paper that the marginal distribution of X is absolutely continuous with respect to Lebesgue measure with density ρ : Rd → R, and that is has a support included in a compact set X ⊂ Rd . [sent-33, score-0.268]

19 We denote η : Rd → [0, 1] a measurable version of the conditional distribution of Y = 1 given X. [sent-34, score-0.088]

20 The normalized Gaussian radial basis function (RBF) kernel kσ with bandwidth parameter σ > 0 is defined for any (x, x′ ) ∈ Rd × Rd by: kσ (x, x′ ) = √ 1 2πσ d exp − x − x′ 2σ 2 2 , and the corresponding reproducing kernel Hilbert space (RKHS) is denoted by Hσ . [sent-35, score-0.412]

21 We √ −d note κσ = 2πσ the normalizing constant that ensures that the kernel integrates to 1. [sent-36, score-0.096]

22 For example, for the hinge loss φ(u) = max(0, 1 − u) one can take L(r) = 1, and for the squared hinge loss φ(u) = max(0, 1 − u)2 one can take L(r) = 2(r + 1). [sent-39, score-0.422]

23 • Finally, the L2 -norm regularized φ-risk is, for any f ∈ M: Rφ,0 (f ) = EP [φ (Y f (X))] + λ f 2 L2 where, f 2 L2 = Rd f (x)2 dx ∈ [0, +∞]. [sent-40, score-0.107]

24 The minima of the three risk functionals defined above over their respective domains are ∗ ∗ denoted by R∗ , Rφ,σ and Rφ,0 respectively. [sent-41, score-0.219]

25 Each of these risks has an empirical counterpart where the expectation with respect to P is replaced by an average over an i. [sent-42, score-0.114]

26 In particular, the following empirical version of Rφ,σ will be used ∀σ > 0, f ∈ Hσ , Rφ,σ (f ) = 1 n n φ (Yi f (Xi )) + λ f i=1 2 Hσ . [sent-49, score-0.057]

27 The main focus of this paper is the analysis of learning algorithms that minimize the empirical φ-risk regularized by the RKHS norm Rφ,σ , and their limit as the number of points tends to infinity and the kernel width σ decreases to 0 at a suitable rate when n tends to ∞, λ being kept fixed. [sent-50, score-0.806]

28 Roughly speaking, our main result shows that in this situation, if φ is a convex loss function, the minimization of Rφ,σ asymptotically amounts to minimizing Rφ,0 . [sent-51, score-0.321]

29 This stems from the fact that the empirical average term in the definition of Rφ,σ converges to its corresponding expectation, while the norm in Hσ of a function f decreases to its L2 norm when σ decreases to zero. [sent-52, score-0.592]

30 To turn this intuition into a rigorous statement, we need a few more assumptions about the minimizer of Rφ,0 and about P . [sent-53, score-0.154]

31 First, we observe that the minimizer of Rφ,0 is indeed well-defined and can often be explicitly computed: Lemma 1 For any x ∈ Rd , let fφ,0 (x) = arg min ρ(x) [η(x)φ(α) + (1 − η)φ(−α)] + λα2 . [sent-54, score-0.19]

32 α∈R Then fφ,0 is measurable and satisfies: Rφ,0 (fφ,0 ) = inf Rφ,0 (f ) f ∈M Second, we provide below a general result that shows how to control the excess Rφ,0 -risk of the empirical minimizer of the Rφ,σ -risk, for which we need to recall the notion of modulus of continuity [12]. [sent-55, score-0.829]

33 Definition 2 (Modulus of continuity) Let f be a Lebesgue measurable function from Rd to R. [sent-56, score-0.088]

34 Then its modulus of continuity in the L1 -norm is defined for any δ ≥ 0 as follows ω(f, δ) = sup f (. [sent-57, score-0.281]

35 ) 0≤ t ≤δ L1 , (3) where t is the Euclidian norm of t ∈ Rd . [sent-59, score-0.109]

36 Our main result can now be stated as follows: ˆ Theorem 3 (Main Result) Let σ1 > σ > 0, 0 < p < 2, δ > 0, and let fφ,σ denote a minimizer of the Rφ,σ risk over Hσ . [sent-60, score-0.455]

37 Assume that the marginal density ρ is bounded, and let M = supx∈Rd ρ(x). [sent-61, score-0.238]

38 of (4) bound the estimation error associated with the gaussian RKHS, which naturally tends to be small when the number of training data increases and when the RKHS is ’small’, i. [sent-69, score-0.218]

39 As is usually the case in such variance/bias splitings, the variance term here depends on the dimension d of the input space. [sent-72, score-0.061]

40 The third term measures the error due to penalizing the L2 -norm of a fixed function in Hσ1 by its . [sent-74, score-0.061]

41 This is a price to pay to get a small estimation error. [sent-76, score-0.052]

42 As for the fourth term, it is a bound on the approximation error of the Gaussian RKHS. [sent-77, score-0.035]

43 In order to highlight the type of convergence rates one can obtain from Theorem 3, let us assume that the φ loss function is Lipschitz on R (e. [sent-79, score-0.281]

44 , take the hinge loss), and suppose that for some 0 ≤ β ≤ 1, c1 > 0, and for any h ≥ 0, the function fφ,0 satisfies the following inequality ω(fφ,0 , h) ≤ c1 hβ . [sent-81, score-0.114]

45 Theorem 3 shows that minimizing the Rφ,σ risk for well-chosen width σ is a an algorithm consistant for the Rφ,0 -risk. [sent-88, score-0.211]

46 In order to state a refined version in the particular case of the support vector machine algorithm, we first need the following definition: Definition 5 We say that a distribution P with ρ as marginal density of X w. [sent-90, score-0.268]

47 Lebesgue 2 measure has a low density exponent γ ≥ 0 if there exists (c2 , ǫ0 ) ∈ (0, +∞) such that ∀ǫ ∈ [0, ǫ0 ], x ∈ Rd : ρ(x) ≤ ǫ P ≤ c2 ǫγ . [sent-93, score-0.206]

48 We are now in position to state a quantitative relationship between the excess Rφ,0 -risk and the excess R-risk in the case of support vector machines: Theorem 6 Let φ1 (α) = max (1 − α, 0) be the hinge loss function, and φ2 (α) = 2 max (1 − α, 0) , be the squared hinge loss function. [sent-94, score-1.13]

49 Hence we have proved the consistency of SVM, together with upper bounds on the convergence rates, in a situation where the effect of regularization does not vanish asymptotically. [sent-96, score-0.718]

50 Another consequence of the Rφ,0 -consistency of an algorithm is the L2 -convergence of the function output by the algorithm to the minimizer of the Rφ,0 -risk: Lemma 7 For any f ∈ M, the following holds: f − fφ,0 2 L2 ≤ 1 ∗ Rφ,0 (f ) − Rφ,0 . [sent-97, score-0.154]

51 λ This result is particularly relevant to study algorithms whose objective are not binary classification. [sent-98, score-0.039]

52 Then we claim the following: Theorem 8 Let ρλ denote the density truncated as follows: ρ(x) 2λ ρλ (x) = 1 if ρ(x) ≤ 2λ, otherwise. [sent-100, score-0.204]

53 Then, under the general conditions of Theorem 3, for σ choosen as in Equation (8), ˆ lim fσ − ρλ L = 0 . [sent-105, score-0.142]

54 Then, under the general assumptions of Theorem 3, we have lim HP (Cµ ) − HP Cµ = 0 , (11) n→+∞ for σ choosen as in Equation (8). [sent-108, score-0.142]

55 The excess-mass functional was first introduced in [10] to assess the quality of density level set estimators. [sent-109, score-0.274]

56 It is maximized by the true density level set Cµ and acts as a risk functional in the one-class framework. [sent-110, score-0.451]

57 The proof ef Theorem 9 is based on the following result: if ρ is a density estimator converging to the true density ρ in the L2 sense, then ˆ for any fixed 0 < µ < sup {ρ}, the excess mass of the level set of ρ at level µ converges ˆ to the excess mass of Cµ . [sent-111, score-1.274]

58 In other words, as is the case in the classification framework, plug-in estimators built on L2 -consistent density estimators are consistent with respect to the excess mass. [sent-112, score-0.517]

59 By studying the relationship between the Gaussian RKHS norm and the L2 norm, it can be shown that ˆ ˆ ˆ ˆ Rφ,0 fφ,σ − Rφ,σ fφ,σ = λ fφ,σ 2 − fφ,σ 2 ≤ 0, L2 Hσ ∗ while the following stems from the definition of Rφ,σ : ∗ Rφ,σ − Rφ,σ (kσ1 ∗ fφ,0 ) ≤ 0. [sent-118, score-0.158]

60 ∗ ˆ Hence, controlling Rφ,0 (fφ,σ )−Rφ,0 boils down to controlling each of the remaining three terms in (12). [sent-119, score-0.07]

61 • The second term in (12) is usually referred to as the sample error or estimation error. [sent-120, score-0.113]

62 Using estimates of local Rademacher complexities through covering numbers for the Gaussian RKHS due to [4], the following result can be shown: ˆ Lemma 10 For any σ > 0 small enough, let fφ,σ be the minimizer of the Rφ,σ risk on a sample of size n, where φ is a convex loss function. [sent-122, score-0.558]

63 • In order to upper bound the fourth term in (12), the analysis of the convergence of the Gaussian RKHS norm towards the L2 norm when the bandwidth of the kernel tends to 0 leads to: Rφ,σ (kσ1 ∗ fφ,0 ) − Rφ,0 (kσ1 ∗ fφ,0 ) = kσ1 ∗ fφ,0 2 σ − kσ1 ∗ fφ,0 2 2 H L σ2 2 fφ,0 2σ1 φ (0) σ 2 ≤ 2 . [sent-124, score-0.776]

64 2λσ1 ≤ 2 L2 • The fifth term in (12) corresponds to the approximation error. [sent-125, score-0.061]

65 It can be shown that for any bounded function in L1 (Rd ) and all σ > 0, the following holds: √ (13) kσ ∗ f − f L1 ≤ (1 + d)ω(f, σ) , where ω(f, . [sent-126, score-0.043]

66 ) denotes the modulus of continuity of f in the L1 norm. [sent-127, score-0.246]

67 Conclusion We have shown that consistency of learning algorithms that minimize a regularized empirical risk can be obtained even when the so-called regularization term does not asymptotically vanish, and derived the consistency of one-class SVM as a density level set estimator. [sent-129, score-1.398]

68 Our method of proof is based on an unusual decomposition of the excess risk due to the presence of the regularization term, which plays an important role in the determination of the asymptotic limit of the function that minimizes the empirical risk. [sent-130, score-0.82]

69 Although the upper bounds on the convergence rates we obtain are not optimal, they provide a first step toward the analysis of learning algorithms in this context. [sent-131, score-0.292]

70 Acknowledgments The authors are grateful to St´ phane Boucheron, Pascal Massart and Ingo Steinwart for e fruitful discussions. [sent-132, score-0.034]

71 This work was supported by the ACI “Nouvelles interfaces des Math´ matiques” of the French Ministry for Research, and by the IST Program of the Eue ropean Community, under the Pascal Network of Excellence, IST-2002-506778. [sent-133, score-0.057]

72 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-151, score-0.514]

73 Fast rates for support vector machines using gaussian kernels. [sent-158, score-0.254]

74 On the estimation of a probability density function by the maximum penalized likelihood method. [sent-190, score-0.215]

75 Estimation of a convex density contour in two dimensions. [sent-209, score-0.257]

76 Consistency and convergence rates of one-class svm and related algorithms. [sent-219, score-0.449]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svm', 0.301), ('rd', 0.288), ('excess', 0.284), ('consistency', 0.243), ('rkhs', 0.221), ('risk', 0.177), ('theorem', 0.172), ('density', 0.163), ('minimizer', 0.154), ('continuity', 0.123), ('modulus', 0.123), ('regularization', 0.117), ('lebesgue', 0.115), ('hinge', 0.114), ('level', 0.111), ('norm', 0.109), ('regularized', 0.107), ('tends', 0.106), ('vanish', 0.103), ('quantile', 0.101), ('loss', 0.097), ('kernel', 0.096), ('convex', 0.094), ('measurable', 0.088), ('rademacher', 0.081), ('asymptotically', 0.08), ('bandwidth', 0.078), ('choosen', 0.078), ('hp', 0.078), ('leb', 0.078), ('vert', 0.077), ('situation', 0.077), ('nity', 0.076), ('rates', 0.075), ('convergence', 0.073), ('decreases', 0.071), ('france', 0.066), ('support', 0.066), ('converges', 0.065), ('limit', 0.065), ('lim', 0.064), ('asymptotic', 0.062), ('reproducing', 0.062), ('fi', 0.062), ('lipschitz', 0.062), ('steinwart', 0.062), ('term', 0.061), ('gaussian', 0.06), ('bartlett', 0.059), ('holds', 0.058), ('proof', 0.058), ('risks', 0.057), ('des', 0.057), ('sketch', 0.057), ('empirical', 0.057), ('hilbert', 0.056), ('upper', 0.055), ('towards', 0.054), ('underlies', 0.054), ('machines', 0.053), ('spirit', 0.053), ('estimation', 0.052), ('paris', 0.051), ('main', 0.05), ('bounds', 0.05), ('bayes', 0.05), ('stems', 0.049), ('held', 0.046), ('sparseness', 0.046), ('classi', 0.044), ('bounded', 0.043), ('statement', 0.043), ('exponent', 0.043), ('ep', 0.043), ('denoted', 0.042), ('truncated', 0.041), ('lemma', 0.041), ('springer', 0.04), ('algorithms', 0.039), ('marginal', 0.039), ('stated', 0.038), ('radial', 0.038), ('rbf', 0.037), ('max', 0.037), ('nition', 0.036), ('let', 0.036), ('estimators', 0.035), ('fourth', 0.035), ('draw', 0.035), ('sup', 0.035), ('controlling', 0.035), ('vanishing', 0.034), ('consistant', 0.034), ('fruitful', 0.034), ('aci', 0.034), ('boucheron', 0.034), ('euclidian', 0.034), ('french', 0.034), ('fth', 0.034), ('invited', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

2 0.22401068 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

3 0.19145255 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

4 0.18332314 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

Abstract: We consider the problem of constructing an aggregated estimator from a finite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1

5 0.14867865 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari

Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1

6 0.13420558 95 nips-2005-Improved risk tail bounds for on-line algorithms

7 0.12902531 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

8 0.12526633 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

9 0.12414719 112 nips-2005-Learning Minimum Volume Sets

10 0.12363494 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

11 0.12217209 50 nips-2005-Convex Neural Networks

12 0.11935195 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

13 0.11698191 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

14 0.11529931 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

15 0.11439464 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

16 0.10443542 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

17 0.1025109 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

18 0.09997049 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

19 0.089202613 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

20 0.088765413 30 nips-2005-Assessing Approximations for Gaussian Process Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.274), (1, 0.173), (2, -0.078), (3, -0.206), (4, 0.22), (5, 0.152), (6, -0.03), (7, -0.051), (8, -0.043), (9, -0.149), (10, 0.02), (11, 0.001), (12, -0.027), (13, 0.027), (14, 0.039), (15, 0.083), (16, 0.124), (17, -0.083), (18, 0.124), (19, -0.009), (20, 0.019), (21, -0.105), (22, 0.028), (23, -0.105), (24, -0.066), (25, 0.098), (26, 0.027), (27, -0.004), (28, -0.009), (29, 0.083), (30, -0.119), (31, 0.029), (32, -0.006), (33, -0.053), (34, -0.077), (35, -0.019), (36, -0.03), (37, 0.092), (38, 0.043), (39, 0.004), (40, -0.047), (41, -0.106), (42, 0.009), (43, 0.04), (44, -0.06), (45, -0.034), (46, 0.101), (47, 0.008), (48, -0.041), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97057289 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

2 0.75799936 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

3 0.70987457 112 nips-2005-Learning Minimum Volume Sets

Author: Clayton Scott, Robert Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1

4 0.69127625 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

5 0.67025226 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire

Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.

6 0.62247503 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

7 0.61797601 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

8 0.59727442 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

9 0.58060545 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

10 0.53135592 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

11 0.51780075 95 nips-2005-Improved risk tail bounds for on-line algorithms

12 0.48395768 182 nips-2005-Statistical Convergence of Kernel CCA

13 0.4764328 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

14 0.46968228 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

15 0.46798518 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

16 0.4481678 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

17 0.44792157 160 nips-2005-Query by Committee Made Real

18 0.43582153 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

19 0.42155874 50 nips-2005-Convex Neural Networks

20 0.41820225 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.095), (5, 0.221), (10, 0.044), (11, 0.015), (18, 0.011), (27, 0.024), (31, 0.086), (34, 0.154), (39, 0.02), (41, 0.015), (55, 0.041), (69, 0.037), (73, 0.024), (88, 0.085), (91, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87321293 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

Author: Odelia Schwartz, Peter Dayan, Terrence J. Sejnowski

Abstract: The misjudgement of tilt in images lies at the heart of entertaining visual illusions and rigorous perceptual psychophysics. A wealth of findings has attracted many mechanistic models, but few clear computational principles. We adopt a Bayesian approach to perceptual tilt estimation, showing how a smoothness prior offers a powerful way of addressing much confusing data. In particular, we faithfully model recent results showing that confidence in estimation can be systematically affected by the same aspects of images that affect bias. Confidence is central to Bayesian modeling approaches, and is applicable in many other perceptual domains. Perceptual anomalies and illusions, such as the misjudgements of motion and tilt evident in so many psychophysical experiments, have intrigued researchers for decades.1–3 A Bayesian view4–8 has been particularly influential in models of motion processing, treating such anomalies as the normative product of prior information (often statistically codifying Gestalt laws) with likelihood information from the actual scenes presented. Here, we expand the range of statistically normative accounts to tilt estimation, for which there are classes of results (on estimation confidence) that are so far not available for motion. The tilt illusion arises when the perceived tilt of a center target is misjudged (ie bias) in the presence of flankers. Another phenomenon, called Crowding, refers to a loss in the confidence (ie sensitivity) of perceived target tilt in the presence of flankers. Attempts have been made to formalize these phenomena quantitatively. Crowding has been modeled as compulsory feature pooling (ie averaging of orientations), ignoring spatial positions.9, 10 The tilt illusion has been explained by lateral interactions11, 12 in populations of orientationtuned units; and by calibration.13 However, most models of this form cannot explain a number of crucial aspects of the data. First, the geometry of the positional arrangement of the stimuli affects attraction versus repulsion in bias, as emphasized by Kapadia et al14 (figure 1A), and others.15, 16 Second, Solomon et al. recently measured bias and sensitivity simultaneously.11 The rich and surprising range of sensitivities, far from flat as a function of flanker angles (figure 1B), are outside the reach of standard models. Moreover, current explanations do not offer a computational account of tilt perception as the outcome of a normative inference process. Here, we demonstrate that a Bayesian framework for orientation estimation, with a prior favoring smoothness, can naturally explain a range of seemingly puzzling tilt data. We explicitly consider both the geometry of the stimuli, and the issue of confidence in the esti- 6 5 4 3 2 1 0 -1 -2 (B) Attraction Repulsion Sensititvity (1/deg) Bias (deg) (A) 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 -20 0 20 40 60 80 Flanker tilt (deg) Figure 1: Tilt biases and sensitivities in visual perception. (A) Kapadia et al demonstrated the importance of geometry on tilt bias, with bar stimuli in the fovea (and similar results in the periphery). When 5 degrees clockwise flankers are arranged colinearly, the center target appears attracted in the direction of the flankers; when flankers are lateral, the target appears repulsed. Data are an average of 5 subjects.14 (B) Solomon et al measured both biases and sensitivities for gratings in the visual periphery.11 On the top are example stimuli, with flankers tilted 22.5 degrees clockwise. This constitutes the classic tilt illusion, with a repulsive bias percept. In addition, sensitivities vary as a function of flanker angles, in a systematic way (even in cases when there are no biases at all). Sensitivities are given in units of the inverse of standard deviation of the tilt estimate. More detailed data for both experiments are shown in the results section. mation. Bayesian analyses have most frequently been applied to bias. Much less attention has been paid to the equally important phenomenon of sensitivity. This aspect of our model should be applicable to other perceptual domains. In section 1 we formulate the Bayesian model. The prior is determined by the principle of creating a smooth contour between the target and flankers. We describe how to extract the bias and sensitivity. In section 2 we show experimental data of Kapadia et al and Solomon et al, alongside the model simulations, and demonstrate that the model can account for both geometry, and bias and sensitivity measurements in the data. Our results suggest a more unified, rational, approach to understanding tilt perception. 1 Bayesian model Under our Bayesian model, inference is controlled by the posterior distribution over the tilt of the target element. This comes from the combination of a prior favoring smooth configurations of the flankers and target, and the likelihood associated with the actual scene. A complete distribution would consider all possible angles and relative spatial positions of the bars, and marginalize the posterior over all but the tilt of the central element. For simplicity, we make two benign approximations: conditionalizing over (ie clamping) the angles of the flankers, and exploring only a small neighborhood of their positions. We now describe the steps of inference. Smoothness prior: Under these approximations, we consider a given actual configuration (see fig 2A) of flankers f1 = (φ1 , x1 ), f2 = (φ2 , x2 ) and center target c = (φc , xc ), arranged from top to bottom. We have to generate a prior over φc and δ1 = x1 − xc and δ2 = x2 − xc based on the principle of smoothness. As a less benign approximation, we do this in two stages: articulating a principle that determines a single optimal configuration; and generating a prior as a mixture of a Gaussian about this optimum and a uniform distribution, with the mixing proportion of the latter being determined by the smoothness of the optimum. Smoothness has been extensively studied in the computer vision literature.17–20 One widely (B) (C) f1 f1 β1 R Probability max smooth Max smooth target (deg) (A) 40 20 0 -20 c δ1 c -40 Φc f2 f2 1 0.8 0.6 0.4 0.2 0 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -80 -60 -40 20 0 20 40 Flanker tilt (deg) 60 80 Figure 2: Geometry and smoothness for flankers, f1 and f2 , and center target, c. (A) Example actual configuration of flankers and target, aligned along the y axis from top to bottom. (B) The elastica procedure can rotate the target angle (to Φc ) and shift the relative flanker and target positions on the x axis (to δ1 and δ2 ) in its search for the maximally smooth solution. Small spatial shifts (up to 1/15 the size of R) of positions are allowed, but positional shift is overemphasized in the figure for visibility. (C) Top: center tilt that results in maximal smoothness, as a function of flanker tilt. Boxed cartoons show examples for given flanker tilts, of the optimally smooth configuration. Note attraction of target towards flankers for small flanker angles; here flankers and target are positioned in a nearly colinear arrangement. Note also repulsion of target away from flankers for intermediate flanker angles. Bottom: P [c, f1 , f2 ] for center tilt that yields maximal smoothness. The y axis is normalized between 0 and 1. used principle, elastica, known even to Euler, has been applied to contour completion21 and other computer vision applications.17 The basic idea is to find the curve with minimum energy (ie, square of curvature). Sharon et al19 showed that the elastica function can be well approximated by a number of simpler forms. We adopt a version that Leung and Malik18 adopted from Sharon et al.19 We assume that the probability for completing a smooth curve, can be factorized into two terms: P [c, f1 , f2 ] = G(c, f1 )G(c, f2 ) (1) with the term G(c, f1 ) (and similarly, G(c, f2 )) written as: R Dβ 2 2 Dβ = β1 + βc − β1 βc (2) − ) where σR σβ and β1 (and similarly, βc ) is the angle between the orientation at f1 , and the line joining f1 and c. The distance between the centers of f1 and c is given by R. The two constants, σβ and σR , control the relative contribution to smoothness of the angle versus the spatial distance. Here, we set σβ = 1, and σR = 1.5. Figure 2B illustrates an example geometry, in which φc , δ1 , and δ2 , have been shifted from the actual scene (of figure 2A). G(c, f1 ) = exp(− We now estimate the smoothest solution for given configurations. Figure 2C shows for given flanker tilts, the center tilt that yields maximal smoothness, and the corresponding probability of smoothness. For near vertical flankers, the spatial lability leads to very weak attraction and high probability of smoothness. As the flanker angle deviates farther from vertical, there is a large repulsion, but also lower probability of smoothness. These observations are key to our model: the maximally smooth center tilt will influence attractive and repulsive interactions of tilt estimation; the probability of smoothness will influence the relative weighting of the prior versus the likelihood. From the smoothness principle, we construct a two dimensional prior (figure 3A). One dimension represents tilt, the other dimension, the overall positional shift between target (B) Likelihood (D) Marginalized Posterior (C) Posterior 20 0.03 10 -10 -20 0 Probability 0 10 Angle Angle Angle 10 0 -10 -20 0.01 -10 -20 0.02 0 -0. 2 0 Position 0.2 (E) Psychometric function 20 -0. 2 0 0.2 -0. 2 0 0.2 Position Position -10 -5 0 Angle 5 10 Probability clockwise (A) Prior 20 1 0.8 0.6 0.4 0.2 0 -20 -10 0 10 20 Target angle (deg) Counter-clockwise Clockwise Figure 3: Bayes model for example flankers and target. (A) Prior 2D distribution for flankers set at 22.5 degrees (note repulsive preference for -5.5 degrees). (B) Likelihood 2D distribution for a target tilt of 3 degrees; (C) Posterior 2D distribution. All 2D distributions are drawn on the same grayscale range, and the presence of a larger baseline in the prior causes it to appear more dimmed. (D) Marginalized posterior, resulting in 1D distribution over tilt. Dashed line represents the mean, with slight preference for negative angle. (E) For this target tilt, we calculate probability clockwise, and obtain one point on psychometric curve. and flankers (called ’position’). The prior is a 2D Gaussian distribution, sat upon a constant baseline.22 The Gaussian is centered at the estimated smoothest target angle and relative position, and the baseline is determined by the probability of smoothness. The baseline, and its dependence on the flanker orientation, is a key difference from Weiss et al’s Gaussian prior for smooth, slow motion. It can be seen as a mechanism to allow segmentation (see Posterior description below). The standard deviation of the Gaussian is a free parameter. Likelihood: The likelihood over tilt and position (figure 3B) is determined by a 2D Gaussian distribution with an added baseline.22 The Gaussian is centered at the actual target tilt; and at a position taken as zero, since this is the actual position, to which the prior is compared. The standard deviation and baseline constant are free parameters. Posterior and marginalization: The posterior comes from multiplying likelihood and prior (figure 3C) and then marginalizing over position to obtain a 1D distribution over tilt. Figure 3D shows an example in which this distribution is bimodal. Other likelihoods, with closer agreement between target and smooth prior, give unimodal distributions. Note that the bimodality is a direct consequence of having an added baseline to the prior and likelihood (if these were Gaussian without a baseline, the posterior would always be Gaussian). The viewer is effectively assessing whether the target is associated with the same object as the flankers, and this is reflected in the baseline, and consequently, in the bimodality, and confidence estimate. We define α as the mean angle of the 1D posterior distribution (eg, value of dashed line on the x axis), and β as the height of the probability distribution at that mean angle (eg, height of dashed line). The term β is an indication of confidence in the angle estimate, where for larger values we are more certain of the estimate. Decision of probability clockwise: The probability of a clockwise tilt is estimated from the marginalized posterior: 1 P = 1 + exp (3) −α.∗k − log(β+η) where α and β are defined as above, k is a free parameter and η a small constant. Free parameters are set to a single constant value for all flanker and center configurations. Weiss et al use a similar compressive nonlinearity, but without the term β. We also tried a decision function that integrates the posterior, but the resulting curves were far from the sigmoidal nature of the data. Bias and sensitivity: For one target tilt, we generate a single probability and therefore a single point on the psychometric function relating tilt to the probability of choosing clockwise. We generate the full psychometric curve from all target tilts and fit to it a cumulative 60 40 20 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (C) Data -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (D) Model 100 100 100 80 0 -10 Model Frequency responding clockwise (B) Data Frequency responding clockwise Frequency responding clockwise Frequency responding clockwise (A) 100 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 -5 0 5 10 Target tilt (deg) Figure 4: Kapadia et al data,14 versus Bayesian model. Solid lines are fits to a cumulative Gaussian distribution. (A) Flankers are tilted 5 degrees clockwise (black curve) or anti-clockwise (gray) of vertical, and positioned spatially in a colinear arrangement. The center bar appears tilted in the direction of the flankers (attraction), as can be seen by the attractive shift of the psychometric curve. The boxed stimuli cartoon illustrates a vertical target amidst the flankers. (B) Model for colinear bars also produces attraction. (C) Data and (D) model for lateral flankers results in repulsion. All data are collected in the fovea for bars. Gaussian distribution N (µ, σ) (figure 3E). The mean µ of the fit corresponds to the bias, 1 and σ to the sensitivity, or confidence in the bias. The fit to a cumulative Gaussian and extraction of these parameters exactly mimic psychophysical procedures.11 2 Results: data versus model We first consider the geometry of the center and flanker configurations, modeling the full psychometric curve for colinear and parallel flanks (recall that figure 1A showed summary biases). Figure 4A;B demonstrates attraction in the data and model; that is, the psychometric curve is shifted towards the flanker, because of the nature of smooth completions for colinear flankers. Figure 4C;D shows repulsion in the data and model. In this case, the flankers are arranged laterally instead of colinearly. The smoothest solution in the model arises by shifting the target estimate away from the flankers. This shift is rather minor, because the configuration has a low probability of smoothness (similar to figure 2C), and thus the prior exerts only a weak effect. The above results show examples of changes in the psychometric curve, but do not address both bias and, particularly, sensitivity, across a whole range of flanker configurations. Figure 5 depicts biases and sensitivity from Solomon et al, versus the Bayes model. The data are shown for a representative subject, but the qualitative behavior is consistent across all subjects tested. In figure 5A, bias is shown, for the condition that both flankers are tilted at the same angle. The data exhibit small attraction at near vertical flanker angles (this arrangement is close to colinear); large repulsion at intermediate flanker angles of 22.5 and 45 degrees from vertical; and minimal repulsion at large angles from vertical. This behavior is also exhibited in the Bayes model (Figure 5B). For intermediate flanker angles, the smoothest solution in the model is repulsive, and the effect of the prior is strong enough to induce a significant repulsion. For large angles, the prior exerts almost no effect. Interestingly, sensitivity is far from flat in both data and model. In the data (Figure 5C), there is most loss in sensitivity at intermediate flanker angles of 22.5 and 45 degrees (ie, the subject is less certain); and sensitivity is higher for near vertical or near horizontal flankers. The model shows the same qualitative behavior (Figure 5D). In the model, there are two factors driving sensitivity: one is the probability of completing a smooth curvature for a given flanker configuration, as in Figure 2B; this determines the strength of the prior. The other factor is certainty in a particular center estimation; this is determined by β, derived from the posterior distribution, and incorporated into the decision stage of the model Data 5 0 -60 -40 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 60 80 -60 -40 0.6 0.5 0.4 0.3 0.2 0.1 -20 0 20 40 Flanker tilt (deg) 60 80 60 80 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -20 0 20 40 Flanker tilt (deg) 60 80 (F) Bias (deg) 10 5 0 -5 0.6 0.5 0.4 0.3 0.2 0.1 -80 (D) 10 -10 -10 80 Sensitivity (1/deg) -80 5 0 -5 -80 -80 -60 -60 -40 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 -10 80 (H) 60 80 Sensitivity (1/deg) Sensititvity (1/deg) Bias (deg) 0.6 0.5 0.4 0.3 0.2 0.1 (G) Sensititvity (1/deg) 0 -5 (C) (E) 5 -5 -10 Model (B) 10 Bias (deg) Bias (deg) (A) 10 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 Figure 5: Solomon et al data11 (subject FF), versus Bayesian model. (A) Data and (B) model biases with same-tilted flankers; (C) Data and (D) model sensitivities with same-tilted flankers; (E;G) data and (F;H) model as above, but for opposite-tilted flankers (note that opposite-tilted data was collected for less flanker angles). Each point in the figure is derived by fitting a cummulative Gaussian distribution N (µ, σ) to corresponding psychometric curve, and setting bias 1 equal to µ and sensitivity to σ . In all experiments, flanker and target gratings are presented in the visual periphery. Both data and model stimuli are averages of two configurations, on the left hand side (9 O’clock position) and right hand side (3 O’clock position). The configurations are similar to Figure 1 (B), but slightly shifted according to an iso-eccentric circle, so that all stimuli are similarly visible in the periphery. (equation 3). For flankers that are far from vertical, the prior has minimal effect because one cannot find a smooth solution (eg, the likelihood dominates), and thus sensitivity is higher. The low sensitivity at intermediate angles arises because the prior has considerable effect; and there is conflict between the prior (tilt, position), and likelihood (tilt, position). This leads to uncertainty in the target angle estimation . For flankers near vertical, the prior exerts a strong effect; but there is less conflict between the likelihood and prior estimates (tilt, position) for a vertical target. This leads to more confidence in the posterior estimate, and therefore, higher sensitivity. The only aspect that our model does not reproduce is the (more subtle) sensitivity difference between 0 and +/- 5 degree flankers. Figure 5E-H depict data and model for opposite tilted flankers. The bias is now close to zero in the data (Figure 5E) and model (Figure 5F), as would be expected (since the maximally smooth angle is now always roughly vertical). Perhaps more surprisingly, the sensitivities continue to to be non-flat in the data (Figure 5G) and model (Figure 5H). This behavior arises in the model due to the strength of prior, and positional uncertainty. As before, there is most loss in sensitivity at intermediate angles. Note that to fit Kapadia et al, simulations used a constant parameter of k = 9 in equation 3, whereas for the Solomon et al. simulations, k = 2.5. This indicates that, in our model, there was higher confidence in the foveal experiments than in the peripheral ones. 3 Discussion We applied a Bayesian framework to the widely studied tilt illusion, and demonstrated the model on examples from two different data sets involving foveal and peripheral estimation. Our results support the appealing hypothesis that perceptual misjudgements are not a consequence of poor system design, but rather can be described as optimal inference.4–8 Our model accounts correctly for both attraction and repulsion, determined by the smoothness prior and the geometry of the scene. We emphasized the issue of estimation confidence. The dataset showing how confidence is affected by the same issues that affect bias,11 was exactly appropriate for a Bayesian formulation; other models in the literature typically do not incorporate confidence in a thoroughly probabilistic manner. In fact, our model fits the confidence (and bias) data more proficiently than an account based on lateral interactions among a population of orientationtuned cells.11 Other Bayesian work, by Stocker et al,6 utilized the full slope of the psychometric curve in fitting a prior and likelihood to motion data, but did not examine the issue of confidence. Estimation confidence plays a central role in Bayesian formulations as a whole. Understanding how priors affect confidence should have direct bearing on many other Bayesian calculations such as multimodal integration.23 Our model is obviously over-simplified in a number of ways. First, we described it in terms of tilts and spatial positions; a more complete version should work in the pixel/filtering domain.18, 19 We have also only considered two flanking elements; the model is extendible to a full-field surround, whereby smoothness operates along a range of geometric directions, and some directions are more (smoothly) dominant than others. Second, the prior is constructed by summarizing the maximal smoothness information; a more probabilistically correct version should capture the full probability of smoothness in its prior. Third, our model does not incorporate a formal noise representation; however, sensitivities could be influenced both by stimulus-driven noise and confidence. Fourth, our model does not address attraction in the so-called indirect tilt illusion, thought to be mediated by a different mechanism. Finally, we have yet to account for neurophysiological data within this framework, and incorporate constraints at the neural implementation level. However, versions of our computations are oft suggested for intra-areal and feedback cortical circuits; and smoothness principles form a key part of the association field connection scheme in Li’s24 dynamical model of contour integration in V1. Our model is connected to a wealth of literature in computer vision and perception. Notably, occlusion and contour completion might be seen as the extreme example in which there is no likelihood information at all for the center target; a host of papers have shown that under these circumstances, smoothness principles such as elastica and variants explain many aspects of perception. The model is also associated with many studies on contour integration motivated by Gestalt principles;25, 26 and exploration of natural scene statistics and Gestalt,27, 28 including the relation to contour grouping within a Bayesian framework.29, 30 Indeed, our model could be modified to include a prior from natural scenes. There are various directions for the experimental test and refinement of our model. Most pressing is to determine bias and sensitivity for different center and flanker contrasts. As in the case of motion, our model predicts that when there is more uncertainty in the center element, prior information is more dominant. Another interesting test would be to design a task such that the center element is actually part of a different figure and unrelated to the flankers; our framework predicts that there would be minimal bias, because of segmentation. Our model should also be applied to other tilt-based illusions such as the Fraser spiral and Z¨ llner. Finally, our model can be applied to other perceptual domains;31 and given o the apparent similarities between the tilt illusion and the tilt after-effect, we plan to extend the model to adaptation, by considering smoothness in time as well as space. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Serge Belongie, Leanne Chukoskie, Philip Meier and Joshua Solomon for helpful discussions. References [1] J J Gibson. Adaptation, after-effect, and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20:553–569, 1937. [2] C Blakemore, R H S Carpentar, and M A Georgeson. Lateral inhibition between orientation detectors in the human visual system. Nature, 228:37–39, 1970. [3] J A Stuart and H M Burian. A study of separation difficulty: Its relationship to visual acuity in normal and amblyopic eyes. American Journal of Ophthalmology, 53:471–477, 1962. [4] A Yuille and H H Bulthoff. Perception as bayesian inference. In Knill and Whitman, editors, Bayesian decision theory and psychophysics, pages 123–161. Cambridge University Press, 1996. [5] Y Weiss, E P Simoncelli, and E H Adelson. Motion illusions as optimal percepts. Nature Neuroscience, 5:598–604, 2002. [6] A Stocker and E P Simoncelli. Constraining a bayesian model of human visual speed perception. Adv in Neural Info Processing Systems, 17, 2004. [7] D Kersten, P Mamassian, and A Yuille. Object perception as bayesian inference. Annual Review of Psychology, 55:271–304, 2004. [8] K Kording and D Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244–247, 2004. [9] L Parkes, J Lund, A Angelucci, J Solomon, and M Morgan. Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4:739–744, 2001. [10] D G Pelli, M Palomares, and N J Majaj. Crowding is unlike ordinary masking: Distinguishing feature integration from detection. Journal of Vision, 4:1136–1169, 2002. [11] J Solomon, F M Felisberti, and M Morgan. Crowding and the tilt illusion: Toward a unified account. Journal of Vision, 4:500–508, 2004. [12] J A Bednar and R Miikkulainen. Tilt aftereffects in a self-organizing model of the primary visual cortex. Neural Computation, 12:1721–1740, 2000. [13] C W Clifford, P Wenderoth, and B Spehar. A functional angle on some after-effects in cortical vision. Proc Biol Sci, 1454:1705–1710, 2000. [14] M K Kapadia, G Westheimer, and C D Gilbert. Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J Neurophysiology, 4:2048–262, 2000. [15] C C Chen and C W Tyler. Lateral modulation of contrast discrimination: Flanker orientation effects. Journal of Vision, 2:520–530, 2002. [16] I Mareschal, M P Sceniak, and R M Shapley. Contextual influences on orientation discrimination: binding local and global cues. Vision Research, 41:1915–1930, 2001. [17] D Mumford. Elastica and computer vision. In Chandrajit Bajaj, editor, Algebraic geometry and its applications. Springer Verlag, 1994. [18] T K Leung and J Malik. Contour continuity in region based image segmentation. In Proc. ECCV, pages 544–559, 1998. [19] E Sharon, A Brandt, and R Basri. Completion energies and scale. IEEE Pat. Anal. Mach. Intell., 22(10), 1997. [20] S W Zucker, C David, A Dobbins, and L Iverson. The organization of curve detection: coarse tangent fields. Computer Graphics and Image Processing, 9(3):213–234, 1988. [21] S Ullman. Filling in the gaps: the shape of subjective contours and a model for their generation. Biological Cybernetics, 25:1–6, 1976. [22] G E Hinton and A D Brown. Spiking boltzmann machines. Adv in Neural Info Processing Systems, 12, 1998. [23] R A Jacobs. What determines visual cue reliability? Trends in Cognitive Sciences, 6:345–350, 2002. [24] Z Li. A saliency map in primary visual cortex. Trends in Cognitive Science, 6:9–16, 2002. [25] D J Field, A Hayes, and R F Hess. Contour integration by the human visual system: evidence for a local “association field”. Vision Research, 33:173–193, 1993. [26] J Beck, A Rosenfeld, and R Ivry. Line segregation. Spatial Vision, 4:75–101, 1989. [27] M Sigman, G A Cecchi, C D Gilbert, and M O Magnasco. On a common circle: Natural scenes and gestalt rules. PNAS, 98(4):1935–1940, 2001. [28] S Mahumad, L R Williams, K K Thornber, and K Xu. Segmentation of multiple salient closed contours from real images. IEEE Pat. Anal. Mach. Intell., 25(4):433–444, 1997. [29] W S Geisler, J S Perry, B J Super, and D P Gallogly. Edge co-occurence in natural images predicts contour grouping performance. Vision Research, 6:711–724, 2001. [30] J H Elder and R M Goldberg. Ecological statistics of gestalt laws for the perceptual organization of contours. Journal of Vision, 4:324–353, 2002. [31] S R Lehky and T J Sejnowski. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10:2281–2299, 1990.

same-paper 2 0.86886317 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

3 0.84011137 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1

4 0.71719551 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

5 0.70860511 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

6 0.69635963 50 nips-2005-Convex Neural Networks

7 0.69608349 184 nips-2005-Structured Prediction via the Extragradient Method

8 0.69042617 58 nips-2005-Divergences, surrogate loss functions and experimental design

9 0.68578804 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

10 0.68089682 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

11 0.68027616 105 nips-2005-Large-Scale Multiclass Transduction

12 0.67477918 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

13 0.67264092 144 nips-2005-Off-policy Learning with Options and Recognizers

14 0.67095315 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

15 0.67020476 177 nips-2005-Size Regularized Cut for Data Clustering

16 0.66916251 178 nips-2005-Soft Clustering on Graphs

17 0.66815394 46 nips-2005-Consensus Propagation

18 0.66750407 112 nips-2005-Learning Minimum Volume Sets

19 0.66737634 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

20 0.66633326 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget