nips nips2003 nips2003-126 knowledge-graph by maker-knowledge-mining

126 nips-2003-Measure Based Regularization

Source: pdf

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. [sent-4, score-0.035]

2 We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. [sent-5, score-0.218]

3 A typical assumption that is (at least implicitly) used in many learning algorithms is the following Two points that are close in input space should have the same label. [sent-12, score-0.136]

4 One possible way to enforce this assumption is to look for a decision function which is consistent with the training data and which does not change too much between neighboring points. [sent-13, score-0.177]

5 This can be done in a regularization setting, using the Lipschitz norm as a regularizer. [sent-14, score-0.259]

6 For diﬀerentiable functions, the Lipschitz norm of a function is the supremum of the norm of the gradient. [sent-15, score-0.226]

7 It is thus natural to consider algorithms of the form min sup f x f (x) under constraints yi f (xi ) ≥ 1. [sent-16, score-0.076]

8 (1) Performing such a minimization on the set of linear functions leads to the maximum margin solution (since the gradient x → w, x is w), whereas the 1-nearest neighbor decision function is one of the solutions of the above optimization problem when the set of functions is unconstrained [13]. [sent-17, score-0.199]

9 Although very useful because widely applicable, the above assumption is sometimes too weak. [sent-18, score-0.084]

10 Indeed, most ’real-world’ learning problems have more structure than what this assumption captures. [sent-19, score-0.084]

11 For example, most data is located in regions where the label is constant (clusters) and regions where the label is not well-deﬁned are typically of low density. [sent-20, score-0.2]

12 This can be formulated via the so-called cluster assumption: Two points that are connected by a line that goes through high density regions should have the same label Another related way of stating this assumption is to say that the decision boundary should lie in regions of low density. [sent-21, score-0.63]

13 Our goal is to propose possible implementations of this assumption. [sent-22, score-0.035]

14 We will thus try to propose a principled approach to this problem. [sent-24, score-0.035]

15 A similar attempt was made in [10] but in a probabilistic context, where the decision function was modeled by a conditional probability distribution, while here we consider arbitrary real-valued functions and use the standard regularization approach. [sent-25, score-0.233]

16 We will use three methods for obtaining regularizers that depend on the distribution P (x) of the data. [sent-26, score-0.083]

17 In section 2 we suggest to modify the regularizer in a general way by weighting it with the data density. [sent-27, score-0.579]

18 Then in section 3 we adopt a geometric approach where we suggest to modify the distances in input space (in a local manner) to take into account the density (i. [sent-28, score-0.309]

19 we stretch or blow up the space depending on the density). [sent-30, score-0.03]

20 The third approach presented in section 4 builds on spectral methods. [sent-31, score-0.061]

21 The idea is to look for the analogue of graph-based spectral methods when the amount of available data is inﬁnite. [sent-32, score-0.061]

22 Finally, in section 5 we give a practical method for implementing one of the proposed regularizers and show its application on a toy problem. [sent-34, score-0.208]

23 2 Density based regularization The ﬁrst approach we propose is to start with a gradient-based regularizer like f which penalizes large variations of the function. [sent-35, score-0.695]

24 Now, to implement the cluster assumption one has to penalize more the variations of the function in high density regions and less in low density regions. [sent-36, score-0.549]

25 A natural way of doing this is to replace f by p f where p is the density of the marginal distribution P . [sent-37, score-0.187]

26 An interesting case is when the norm in (2) is chosen as the L2 norm. [sent-39, score-0.113]

27 Then, Ω(f ) can be the norm of a Reproducing Kernel Hilbert Space (RKHS), which means that there exist an Hilbert space H and a kernel function k : X 2 → R such that f, f H = Ω(f ) and f, k(x, ·) H = f (x). [sent-40, score-0.172]

28 (3) The reason for using an RKHS norm is the so-called representer theorem [5]: the function minimizing the corresponding regularized loss can be expressed as a linear combination of the kernel function evaluated at the labeled points. [sent-41, score-0.319]

29 However, it is not straightforward to ﬁnd the kernel associated with an RKHS norm. [sent-42, score-0.059]

30 For instance, in the case L(f ) = (f 2 + 2 f )1/2 and without taking the density into account (χ = 1), it has been shown in [3] that the corresponding kernel is the Laplacian one, k(x, y) = exp(− x − y L1 ) with associated inner product f, g H = f, g L2 + f, g L2 . [sent-44, score-0.287]

31 Taking the density into account, this inner product becomes f, g H = f, χ2 (p)g L2 + f, χ2 (p) g L2 . [sent-45, score-0.187]

32 Since ﬁnding the kernel function associated to a regularizer is, in general, a diﬃcult problem, we propose to perform the minimization of the regularized loss on a ﬁxed set of basis functions, i. [sent-52, score-0.743]

33 f is expressed as a linear combination of functions ϕ i . [sent-54, score-0.045]

34 (4) i=1 We will present in section 5 a practical implementation of this approach. [sent-56, score-0.038]

35 3 Density based change of geometry We now try to adopt a geometric point of view. [sent-57, score-0.13]

36 First we translate the cluster assumption into a geometric statement, then we explore how to enforce it by changing the geometry of our underlying space. [sent-58, score-0.369]

37 We will see that there exists such a change of geometry which leads to the same type of regularizer that was proposed in section 2. [sent-60, score-0.593]

38 Recall that the cluster assumption states that points are likely to be in the same class if they can be connected by a path through high density regions. [sent-61, score-0.449]

39 Naturally this means that we have to weight paths according to the density they are going through. [sent-62, score-0.184]

40 This leads to introducing a new distance measure on the input space (typically Rd ) deﬁned as the length of the shortest weighted path connecting two points. [sent-63, score-0.208]

41 With this new distance, we simply have to enforce that close points have the same label (we thus recover the standard assumption). [sent-64, score-0.146]

42 We consider the euclidean space Rd as a ﬂat Riemannian manifold with metric tensor δ, denoted by (Rn , δ). [sent-66, score-0.207]

43 The only information we have is the local density p(x), which is a scalar at every point and as such can only lead to an isotropic transformation in the tangent space Tx M. [sent-68, score-0.185]

44 Therefore we consider the following conformal transformation of the metric δ 1 δij → gij = δij (6) χ(p(x)) where χ is a strictly increasing function. [sent-69, score-0.195]

45 We denote by (Rd , g) the distorted euclidean space. [sent-70, score-0.084]

46 Note that this kind of transformation also changes the volume ele√ ment gdx1 . [sent-71, score-0.076]

47 dxd χ(p)d/2 (7) In the following we will choose χ(x) = x, which is the simplest choice which gives the desired properties. [sent-84, score-0.244]

48 The distance structure of the transformed space implements now the cluster assumption, since we see from (5) that all paths get weighted by the inverse density. [sent-85, score-0.247]

49 Therefore we can use any metric based classiﬁcation method and it will automatically take into account the density of the data. [sent-86, score-0.25]

50 For example the nearest neighbor classiﬁer in the new distance is equivalent to the Lipschitz regularization (1) weighted with the density proposed in the last section. [sent-87, score-0.404]

51 However, implementing such a method requires to compute the geodesic distance in (Rd , g), which is non trivial for arbitrary densities p. [sent-88, score-0.25]

52 We suggest the following approximation which is similar in spirit to the approach in [11]. [sent-89, score-0.031]

53 The geodesic distance can then 2 be approximated by the shortest path along the obtained graph. [sent-91, score-0.266]

54 We now show the relationship to the the regularization based approach of the previous section. [sent-92, score-0.146]

55 We denote by · L2 (Rd ,g,Σ) the L2 norm in (Rd , g) with respect to the measure Σ and by µ the standard Lebesgue measure on Rd . [sent-93, score-0.191]

56 Let us consider 2 the regularizer f L2 (Rd ,δ,µ) which is the standard L2 norm of the gradient. [sent-94, score-0.627]

57 Now modifying this regularizer according to section 2 (by changing the underlying mea2 sure) gives S(f ) = f L2 (Rd ,δ,P ) . [sent-95, score-0.568]

58 On the distorted space (Rd , g) we keep the Lebesgue measure µ which can be done by integrating on the manifold with re1 spect to the density σ = √g = pd/2 , which cancels then with the volume element √ ∂f ∂f f 2 = p(x)δ ij ∂xi ∂xj we σ gdx1 . [sent-96, score-0.448]

59 S(f ) = f 2 L2 (Rd ,δ,P ) p(x)δ ij = Rd ∂f ∂f dx1 . [sent-104, score-0.048]

60 dxd = ∂xi ∂xj f 2 L2 (Rd ,g,µ) (9) This shows that modifying the measure and keeping the geometry, or modifying the geometry and keeping the Lebesgue measure leads to the same regularizer S(f ). [sent-107, score-1.023]

61 Indeed, for regularization operators corresponding to higher order derivatives the above correspondence is not valid any more. [sent-109, score-0.185]

62 4 Link with Spectral Techniques Recently, there has been a lot of interest in spectral techniques for non linear dimension reduction, clustering or semi-supervised learning. [sent-110, score-0.109]

63 The general idea of these approaches is to construct an adjacency graph on the (unlabeled) points whose weights are given by a matrix W . [sent-111, score-0.095]

64 Then the ﬁrst eigenvectors of a modiﬁed version of W give a more suitable representation of the points (taking into account their manifold and/or cluster structure). [sent-112, score-0.259]

65 instances sampled according to P (x), it is possible to rewrite (10) after normalization as the following random variable Uf = 1 2m(m − 1) i,j (f (xi ) − f (xj ))2 K( xi − xj /t) . [sent-120, score-0.179]

66 Under the assumption that f and K are bounded, the result of [4] (see Inequality (5. [sent-121, score-0.084]

67 This shows that for each ﬁxed function, the normalized regularizer Uf converges towards its expectation when the sample size increases. [sent-123, score-0.514]

68 (11) This is the term that should be used as a regularizer if one knows the whole distribution since it is the limit of (10)1 . [sent-125, score-0.562]

69 The following proposition relates the regularizer (11) to the one deﬁned in (2). [sent-126, score-0.545]

70 1 If p is a density which is Lipschitz continuous and K is a continuous function on R+ such that x2+d K(x) ∈ L2 , then for any function f ∈ C 2 (Rd ) with bounded hessian d C t2+d lim t→0 = where C = Rd f (x) x 2 (f (x) − f (y))2 K( x − y /t)p(x)p(y)dxdy 2 p2 (x)dx, (12) (13) K( x ) dx. [sent-128, score-0.216]

71 To conclude the proof, we rewrite this last integral as 2 hh K( h )dh f (x) = f (x) C . [sent-131, score-0.177]

72 The last equality comes f (x) d from the fact that, by symmetry considerations, hh K( h )dh is equal to a constant (let’s call it C2 ) times the identity matrix and this constant can be computed by C2 d = trace hh K( h )dh = trace h hK( h )dh = C. [sent-132, score-0.274]

73 Note that diﬀerent K lead to diﬀerent aﬃnity matrices: if we choose K(x) = exp(−x2 /2), we get a gaussian RBF aﬃnity matrix as used in [7], whereas K(x) = 1x≤1 leads to an unweighted neighboring graph (at size t) [1]. [sent-133, score-0.043]

74 In [2], the authors investigated the limiting behavior of the regularizer D − W obtained from the graph and claimed that this is the empirical counterpart of the Laplace operator deﬁned on the manifold. [sent-137, score-0.597]

75 We have shown that, in the general case, the 2 continuous equivalent of the graph Laplacian is ∗ Dp . [sent-139, score-0.075]

76 5 Practical Implementation and Experiments As mentioned in section 2, it is diﬃcult in general to ﬁnd the kernel associated with a given regularizer and instead, we decided to minimize the regularized loss on a ﬁxed basis of functions (ϕi )1≤i≤l , as expressed by equation (4). [sent-140, score-0.765]

77 The regularizer we considered is of the form (2) and is, √ 2 Ω(f ) = f p L2 = f (x) · f (x)p(x)dx. [sent-141, score-0.514]

78 Thus, the coeﬃcients α and b in expansion (4) are found by minimizing the following convex regularized functional 1 n l n αi αj (f (xi ), yi ) +λ ϕi (x) · i,j=1 i=1 √ L(f ) p Remp (f ) ϕj (x)p(x)dx . [sent-142, score-0.147]

79 The dual formulation of this optimization problem turns out to be the standard SVM one with a modiﬁed kernel function (see also [9]): n n max β i=1 βi − 1 βi βj yi yj Lij , 2 i,j=1 2. [sent-144, score-0.102]

80 5 −4 −3 −2 −1 0 1 2 3 4 Figure 1: Two moons toy problem: there are 2 labeled points (the cross and the triangle) and 200 unlabeled points. [sent-150, score-0.3]

81 The function was expanded on all unlabeled points (m=200 in (4)) and the widths of the gaussians have been chosen as σ = 0. [sent-152, score-0.144]

82 under constraints 0 ≤ βi ≤ C and βi yi = 0, with L = KH −1 K . [sent-155, score-0.043]

83 Once the vector β has been found, the coeﬃcients α of the expansion are given by α = H −1 K diag(Y )β. [sent-156, score-0.042]

84 From now on, we consider a special case where this integral can be computed analytically: 2 • The basis functions are gaussian RBF, ϕi (x) = exp − x−x2i , where 2σ the points x1 , . [sent-158, score-0.17]

85 We decided to take the unlabeled points (or a subset of them) for this expansion. [sent-162, score-0.19]

86 • The marginal density p is estimated using a Parzen window with a Gaussian 2 m 1 kernel, p(x) = m i=1 exp − x−x2i . [sent-163, score-0.221]

87 After careful dataset selection [6], we considered the two moons toy problem (see ﬁgure 1). [sent-165, score-0.11]

88 On this 2D example, the regularizer we suggested implements perfectly the cluster assumption: the function is smooth on high density regions and the decision boundary lies in a low density region. [sent-166, score-1.059]

89 The reason might be that in dimension more than 2, the gradient does not yield a suitable regularizer: there exists non continuous functions whose regularizer is 0. [sent-168, score-0.639]

90 To avoid this, from the Sobolev embedding lemma, we consider derivatives of order at least d/2. [sent-169, score-0.039]

91 More speciﬁcally, we are currently investigating the regularizer associated with a Gaussian kernel of width σr [8, page 100], ∞ 2p σr p! [sent-170, score-0.573]

92 Starting from the assumption that the distribution P (x) of the data is known, we have proposed several ideas to implement this principle and shown their relationships. [sent-175, score-0.084]

93 In addition, we have shown the relationship to the limiting behavior of an algorithm based on the graph Laplacian. [sent-176, score-0.043]

94 From a theoretical point of view, other types of regularizers, involving, for example, higher order derivatives should be studied. [sent-178, score-0.039]

95 Also from a practical point of view, we should derive eﬃcient algorithms from the proposed ideas, especially by obtaining ﬁnite sample approximations of the limit case where P (x) is known. [sent-179, score-0.038]

96 Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. [sent-195, score-0.194]

97 In Advances in Neural Information Processing Systems, volume 15, 2002. [sent-208, score-0.043]

98 In Advances in Neural Information Processing Systems, volume 14, 2001. [sent-216, score-0.043]

99 On a kernel-based method for pattern recognition, regression, approximation and operator inversion. [sent-225, score-0.04]

100 In Advances in Neural Information Processing Systems, volume 15. [sent-231, score-0.043]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regularizer', 0.514), ('rd', 0.331), ('dxd', 0.244), ('uf', 0.212), ('dh', 0.166), ('density', 0.152), ('regularization', 0.146), ('hp', 0.138), ('lipschitz', 0.129), ('norm', 0.113), ('dp', 0.11), ('gij', 0.105), ('hh', 0.105), ('hij', 0.105), ('cluster', 0.104), ('geodesic', 0.097), ('unlabeled', 0.092), ('assumption', 0.084), ('td', 0.083), ('regularizers', 0.083), ('geometry', 0.079), ('di', 0.076), ('xj', 0.075), ('dx', 0.075), ('distance', 0.073), ('lebesgue', 0.073), ('xi', 0.071), ('rkhs', 0.069), ('regularized', 0.062), ('wij', 0.062), ('manifold', 0.062), ('spectral', 0.061), ('kernel', 0.059), ('path', 0.057), ('metric', 0.057), ('regions', 0.057), ('toy', 0.055), ('moons', 0.055), ('modifying', 0.054), ('nity', 0.052), ('xk', 0.052), ('laplacian', 0.052), ('points', 0.052), ('geometric', 0.051), ('riemannian', 0.051), ('enforce', 0.051), ('tensor', 0.048), ('knows', 0.048), ('kij', 0.048), ('non', 0.048), ('ij', 0.048), ('vincent', 0.046), ('decided', 0.046), ('labeled', 0.046), ('functions', 0.045), ('distorted', 0.044), ('label', 0.043), ('yi', 0.043), ('graph', 0.043), ('volume', 0.043), ('belkin', 0.042), ('decision', 0.042), ('expansion', 0.042), ('account', 0.041), ('coe', 0.041), ('olivier', 0.041), ('operator', 0.04), ('euclidean', 0.04), ('shortest', 0.039), ('goes', 0.039), ('loss', 0.039), ('measure', 0.039), ('integral', 0.039), ('derivatives', 0.039), ('implements', 0.038), ('practical', 0.038), ('hilbert', 0.036), ('marginal', 0.035), ('inner', 0.035), ('propose', 0.035), ('exp', 0.034), ('modify', 0.034), ('inf', 0.034), ('minimization', 0.034), ('neighbor', 0.033), ('rewrite', 0.033), ('sup', 0.033), ('transformation', 0.033), ('paths', 0.032), ('implementing', 0.032), ('continuous', 0.032), ('trace', 0.032), ('neighborhood', 0.032), ('suggest', 0.031), ('proposition', 0.031), ('cancels', 0.03), ('spect', 0.03), ('parzen', 0.03), ('stretch', 0.03), ('kh', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

2 0.18310721 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. 1

3 0.13901684 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Author: Liva Ralaivola, Florence D'alché-buc

Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: ﬁrst, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1

4 0.13568929 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

Author: Denis V. Chigirev, William Bialek

Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the inﬁnite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the inﬁnite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a ﬁxed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artiﬁciality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well deﬁned, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are speciﬁed (due to the ﬁxed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm ﬁnds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than ﬁlling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by deﬁning the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then ﬁnd optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to deﬁne the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of ﬁxed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to ﬁnd a “simpliﬁed” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justiﬁed: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), deﬁne a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to deﬁne an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is ﬁxed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reﬂect the compression level we are trying to achieve. To ﬁnd the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to ﬁnd the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily ﬁnd the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well deﬁned intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then deﬁne 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the ﬁnal positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is ﬁxed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to ﬁnite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of ﬁxed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difﬁcult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of ﬁnding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on speciﬁc smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufﬁcient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufﬁcient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reﬂects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.

5 0.13302279 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a uniﬁed framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1

6 0.12920998 107 nips-2003-Learning Spectral Clustering

7 0.11814183 128 nips-2003-Minimax Embeddings

8 0.10823005 122 nips-2003-Margin Maximizing Loss Functions

9 0.10649183 120 nips-2003-Locality Preserving Projections

10 0.097362712 81 nips-2003-Geometric Analysis of Constrained Curves

11 0.09649013 108 nips-2003-Learning a Distance Metric from Relative Comparisons

12 0.095779218 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

13 0.094576478 117 nips-2003-Linear Response for Approximate Inference

14 0.091360606 46 nips-2003-Clustering with the Connectivity Kernel

15 0.0885875 51 nips-2003-Design of Experiments via Information Theory

16 0.08761435 112 nips-2003-Learning to Find Pre-Images

17 0.084805936 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

18 0.083617046 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

19 0.081206031 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

20 0.080784529 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.27), (1, -0.162), (2, -0.073), (3, 0.005), (4, 0.028), (5, 0.12), (6, -0.12), (7, -0.012), (8, 0.034), (9, -0.121), (10, 0.099), (11, 0.076), (12, 0.1), (13, 0.007), (14, -0.023), (15, -0.03), (16, 0.107), (17, 0.054), (18, -0.038), (19, -0.028), (20, 0.01), (21, 0.008), (22, 0.045), (23, 0.042), (24, -0.092), (25, -0.039), (26, -0.02), (27, 0.076), (28, 0.075), (29, 0.127), (30, -0.084), (31, -0.003), (32, 0.052), (33, 0.072), (34, 0.026), (35, 0.022), (36, -0.012), (37, -0.105), (38, -0.123), (39, -0.038), (40, -0.12), (41, -0.018), (42, -0.06), (43, -0.015), (44, -0.1), (45, 0.0), (46, -0.085), (47, -0.014), (48, -0.007), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95225447 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

2 0.67872298 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by ﬁnding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is deﬁned everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.

3 0.6711185 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

4 0.66357821 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

5 0.59400326 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

Author: Denis V. Chigirev, William Bialek

6 0.58903325 128 nips-2003-Minimax Embeddings

7 0.57583827 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

8 0.53790915 112 nips-2003-Learning to Find Pre-Images

9 0.5067926 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

10 0.49576101 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

11 0.49274731 108 nips-2003-Learning a Distance Metric from Relative Comparisons

12 0.47148508 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

13 0.46935201 46 nips-2003-Clustering with the Connectivity Kernel

14 0.45140919 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

15 0.44273674 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

16 0.4393968 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression

17 0.43673986 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

18 0.42605203 107 nips-2003-Learning Spectral Clustering

19 0.42589387 51 nips-2003-Design of Experiments via Information Theory

20 0.42125687 172 nips-2003-Semi-Supervised Learning with Trees

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.043), (6, 0.199), (11, 0.036), (30, 0.014), (35, 0.066), (48, 0.017), (53, 0.132), (69, 0.033), (71, 0.101), (76, 0.049), (78, 0.022), (85, 0.085), (91, 0.093), (99, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96015173 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons

Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin

Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.

same-paper 2 0.85644108 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

3 0.73845267 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate inﬁnity-sample properties of risk minimization based multi-category classiﬁcation methods. These methods can be considered as natural extensions to binary large margin classiﬁcation. We establish conditions that guarantee the inﬁnity-sample consistency of classiﬁers obtained in the risk minimization framework. Examples are provided for two speciﬁc forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classiﬁcation. 1 Motivation Consider a binary classiﬁcation problem where we want to predict label y ∈ {±1} based on observation x. One of the most signiﬁcant achievements for binary classiﬁcation in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classiﬁcation algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classiﬁcation error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classiﬁcation is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difﬁculties associated with direct classiﬁcation error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classiﬁers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classiﬁcation method should be consistent in the large sample limit. Although statistical properties of binary classiﬁcation algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classiﬁcation formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classiﬁers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classiﬁers in the inﬁnity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classiﬁcation formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classiﬁcation We consider the following K-class classiﬁcation problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classiﬁcation loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classiﬁcation, the class label can be determined using the sign of a decision function. This can be generalized to K class classiﬁcation problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classiﬁcation error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classiﬁcation problem into binary classiﬁcation sub-problems. The consistency of such encoding schemes can be difﬁcult to analyze, and we shall not discuss them. For example, in the binary classiﬁcation case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classiﬁcation rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classiﬁcation method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we ﬁrst need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following deﬁnition (also referred to as classiﬁcation calibrated in [1]): Deﬁnition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classiﬁcation calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classiﬁcation error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the inﬁnity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classiﬁcation problems. The classiﬁer obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classiﬁcation, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two speciﬁc choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The ﬁrst term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classiﬁcation loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a ﬁxed q. Given q, the optimal solution f ∗ of L(q, f ) satisﬁes the following ﬁrst order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ deﬁned by (8) can be regarded as the (inﬁnite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efﬁcient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more ﬂexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a ﬁxed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modiﬁcation. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at inﬁnities, and thus not admissible in our deﬁnition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we deﬁne Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisﬁes the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufﬁcient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classiﬁcation formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difﬁcult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classiﬁcation problems, which are natural extensions of binary large margin classiﬁcation methods. We established admissibility conditions that ensure the consistency of the obtained classiﬁers in the large sample limit. Two speciﬁc forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classiﬁcation, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classiﬁcation based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classiﬁcation of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classiﬁcation. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classiﬁcation - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

4 0.73042637 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

5 0.72992909 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos

Abstract: Over the last years signiﬁcant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classiﬁers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically ﬁnding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identiﬁcation/veriﬁcation and image classiﬁcation tasks and show that these new kernels have the best performance in speaker veriﬁcation and mostly outperform the Fisher kernel based SVM’s and the generative classiﬁers in speaker identiﬁcation and image classiﬁcation. 1

6 0.72902507 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

7 0.72793049 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

8 0.72556299 120 nips-2003-Locality Preserving Projections

9 0.72403699 143 nips-2003-On the Dynamics of Boosting

10 0.72184187 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.72057688 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

12 0.72050351 189 nips-2003-Tree-structured Approximations by Expectation Propagation

13 0.7204417 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

14 0.71970916 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

15 0.71867937 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

16 0.71851802 78 nips-2003-Gaussian Processes in Reinforcement Learning

17 0.71766281 107 nips-2003-Learning Spectral Clustering

18 0.71746314 122 nips-2003-Margin Maximizing Loss Functions

19 0.71659762 102 nips-2003-Large Scale Online Learning

20 0.71659261 30 nips-2003-Approximability of Probability Distributions