nips nips2001 nips2001-120 knowledge-graph by maker-knowledge-mining

120 nips-2001-Minimax Probability Machine


Source: pdf

Author: Gert Lanckriet, Laurent E. Ghaoui, Chiranjib Bhattacharyya, Michael I. Jordan

Abstract: When constructing a classifier, the probability of correct classification of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassification of future data is obtained explicitly. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract When constructing a classifier, the probability of correct classification of future data points should be maximized. [sent-12, score-0.074]

2 In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. [sent-13, score-0.084]

3 We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. [sent-14, score-0.125]

4 A worst-case bound on the probability of misclassification of future data is obtained explicitly. [sent-15, score-0.1]

5 1 Introduction Consider the problem of choosing a linear discriminant by minimizing the probabilities that data vectors fall on the wrong side of the boundary. [sent-16, score-0.08]

6 One way to attempt to achieve this is via a generative approach in which one makes distributional assumptions about the class-conditional densities and thereby estimates and controls the relevant probabilities. [sent-17, score-0.191]

7 The need to make distributional assumptions, however, casts doubt on the generality and validity of such an approach, and in discriminative solutions to classification problems it is common to attempt to dispense with class-conditional densities entirely. [sent-18, score-0.245]

8 Rather than avoiding any reference to class-conditional densities, it might be useful to attempt to control misclassification probabilities in a worst-case setting; that is, under all possible choices of class-conditional densities. [sent-19, score-0.111]

9 Such a minimax approach could be viewed as providing an alternative justification for discriminative approaches. [sent-20, score-0.19]

10 In this paper we show how such a minimax programme can be carried out in the setting of binary classification. [sent-21, score-0.219]

11 edur gert/ and Sethuraman [2]: where y is a random vector, where a and b are constants, and where the supremum is taken over all distributions having mean y and covariance matrix ~y. [sent-25, score-0.037]

12 This theorem provides us with the ability to bound the probability of misclassifying a point, without making Gaussian or other specific distributional assumptions. [sent-26, score-0.071]

13 We will show how to exploit this ability in the design of linear classifiers. [sent-27, score-0.024]

14 One of the appealing features of this formulation is that one obtains an explicit upper bound on the probability of misclassification of future data: 1/(1 + rP). [sent-28, score-0.153]

15 A second appealing feature of this approach is that, as in linear discriminant analysis [7], it is possible to generalize the basic methodology, utilizing Mercer kernels and thereby forming nonlinear decision boundaries. [sent-29, score-0.201]

16 The paper is organized as follows: in Section 2 we present the minimax formulation for linear classifiers, while in Section 3 we deal with kernelizing the method. [sent-31, score-0.193]

17 2 Maximum probabilistic decision hyperplane In this section we present our minimax formulation for linear decision boundaries. [sent-33, score-0.353]

18 We want to determine the hyperplane aT z = b (a, z E JRn and b E JR) that separates the two classes of points with maximal probability with respect to all distributions having these means and covariance matrices. [sent-36, score-0.16]

19 a ,a ,b inf Pr{ aT x 2: b} 2: a (2) or, max a a,a,b s. [sent-39, score-0.115]

20 1 - a 2: sup Pr{ aT x 1- a :s b} (3) 2: sup Pr{aT y 2: b} . [sent-41, score-0.082]

21 Recall the result of Bertsimas and Sethuraman [2]: 1 supPr{aTY2:b}=-d2' with 1+ d2 = inf (Y_Yf~y-1(y_y) (4) aTy? [sent-43, score-0.051]

22 b We can write this as d2 = infcTw>d w Tw, where w = ~y -1 /2 (y_y), c T = aT~y 1/2 and d = b - aTy. [sent-44, score-0.06]

23 To solve this,-first notice that we can assume that aTy :S b (i. [sent-45, score-0.07]

24 y is classified correctly by the decision hyperplane aT z = b): indeed, otherwise we would find d2 = 0 and thus a = 0 for that particular a and b, which can never be an optimal value. [sent-47, score-0.236]

25 (d - c T w), (5) which is to be maximized with respect to A 2: 0 and minimized with respect to w . [sent-51, score-0.044]

26 (6) Using (4), the second constraint in (3) becomes 1-0: 2: 1/(I+d2 ) or ~ 2: 0:/(1-0:). [sent-54, score-0.049]

27 Taking (6) into account, this boils down to: b-aTY2:,,(o:)/aT~ya V where ,,(0:)=) 0: 1-0: (7) We can handle the first constraint in (3) in a similar way (just write aT x ::::: b as _aT x 2: -b and apply the result (7) for the second constraint). [sent-55, score-0.168]

28 The optimization problem (3) then becomes: max 0: -b + aTx 2: ,,(o:)JaT~xa s. [sent-56, score-0.151]

29 Because "(0:) is a monotone increasing function of 0:, we can write this as: max" (9) s. [sent-59, score-0.103]

30 From both constraints in (9), we get aTy + "JaT~ya::::: b::::: aTx - "JaT~xa, (10) which allows us to eliminate b from (9): max" I<,a s. [sent-62, score-0.051]

31 (11) Because we want to maximize ", it is obvious that the inequalities in (10) will become equalities at the optimum. [sent-65, score-0.045]

32 The optimal value of b will thus be given by (12) where a* and "* are the optimal values of a and " respectively. [sent-66, score-0.066]

33 Rearranging the constraint in (11), we get: aT(x - y) 2:" (JaT~xa+ JaT~ya). [sent-67, score-0.049]

34 (13) The above is positively homogeneous in a: if a satisfies (13), sa with s E 114 also does. [sent-68, score-0.023]

35 The optimization problem (11) then becomes max" I<,a s. [sent-71, score-0.087]

36 ~ 2: JaT~xa + JaT~ya (14) a T (x-Y)=I , which allows us to eliminate ,,: m~n JaT~xa + JaT~ya s. [sent-73, score-0.03]

37 aT(x - y) = 1, (15) or, equivalently (16) This is a convex optimization problem, more precisely a second order cone program (SOCP) [8,5]. [sent-75, score-0.155]

38 Furthermore, notice that we can write a = ao +Fu, where U E Il~n-l, ao = (x - y)/llx - y112, and F E IRnx (n-l) is an orthogonal matrix whose columns span the subspace of vectors orthogonal to x - y. [sent-76, score-0.422]

39 Using this we can write (16) as an unconstrained SOCP: (17) We can solve this problem in various ways, for example using interior-point methods for SOCP [8], which yield a worst-case complexity of O(n 3 ). [sent-77, score-0.147]

40 Of course, the first and second moments of x, y must be estimated from data, using for example plug-in estimates X, y, :Ex, :E y for respectively x, y, ~x, ~y. [sent-78, score-0.024]

41 This brings the total complexity to O(ln 3 ), where l is the number of data points. [sent-79, score-0.021]

42 This is the same complexity as the quadratic programs one has to solve in support vector machines. [sent-80, score-0.062]

43 In our implementations, we took an iterative least-squares approach, which is based on the following form , equivalent to (17): (18) At iteration k , we first minimize with respect to 15 and E by setting 15k = II~x 1/2(ao + Ek = II~y 1/2(ao + Fu k - 1)112. [sent-81, score-0.047]

44 Then we minimize with respect to U by solving a least squares problem in u for 15 = 15k and E = Ek, which gives us Uk. [sent-82, score-0.047]

45 Because in both update steps the objective of this COP will not increase, the iteration will converge to the global minimum II~xl/2(ao + Fu*)112 + II~yl /2(ao + Fu*)lb with u* an optimal value of u. [sent-83, score-0.068]

46 Fu k - d112 and We then obtain a* as ao + Fu* and b* from (12) with "'* = l/h/ar~xa* + Jar~ya*). [sent-84, score-0.12]

47 Classification of a new data point Zn ew is done by evaluating sign( a;; Zn ew - b*): if this is +1, Zn ew is classified as from class x, otherwise Zn ew is classified as from class y. [sent-85, score-0.516]

48 It is interesting to see what happens if we make distributional assumptions; in particular, let us assume that x "" N(x, ~x) and y "" N(y, ~y). [sent-86, score-0.071]

49 This leads to the following optimization problem: max a o:, a ,b S. [sent-87, score-0.126]

50 We thus solve the same optimization l~a problem (a disappears from the optimization problem because ",(a) is monotone increasing) and find the same decision hyperplane aT z = b. [sent-92, score-0.412]

51 The difference lies in the value of a associated with "'*: a will be higher in this case, so the hyperplane will have a higher predicted probability of classifying future data correctly. [sent-93, score-0.101]

52 Kernelization 3 In this section we describe the "kernelization" of the minimax approach described in the previous section. [sent-94, score-0.143]

53 We seek to map the problem to a higher dimensional feature space ]Rf via a mapping cP : ]Rn 1-+ ]Rf, such that a linear discriminant in the feature space corresponds to a nonlinear discriminant in the original space. [sent-95, score-0.157]

54 To carry out this programme, we need to try to reformulate the minimax problem in terms of a kernel function K(Z1' Z2) = cp(Z1)T CP(Z2) satisfying Mercer's condition. [sent-96, score-0.188]

55 The decision hyperplane in ]Rf is then given by aT cp(Z) = b with a, cp(z) E ]Rf and b E ]R. [sent-98, score-0.132]

56 In ]Rf, we need to solve the following optimization problem: mln Jr-aT-~-cp-(-x)-a + JaT~cp(y)a aT (cp(X) - cp(y)) = 1, s. [sent-99, score-0.103]

57 (20) where, as in (12), the optimal value of b will be given by b* = a; cp(x) - "'*Jar~cp(x)a* + "'*Jar~cp(y)a*, = a; cp(y) (21) where a* and "'* are the optimal values of a and '" respectively. [sent-101, score-0.066]

58 However, we do not wish to solve the COP in this form, because we want to avoid using f or cp explicitly. [sent-102, score-0.683]

59 If a has a component in ]Rf which is orthogonal to the subspace spanned by CP(Xi), i = 1,2, . [sent-103, score-0.058]

60 , Ny, then that component won't affect the objective or the constraint in (20) . [sent-109, score-0.084]

61 2:i~1(CP(Yi) - cp(y))(cp(Yi) - cp(y))T for the means and the covariance matri- ces in the objective and the constraint of the optimization problem (20), we see that both the objective and the constraints can be written in terms of the kernel function K(Zl' Z2) = CP(Z1)T cp(Z2) . [sent-143, score-0.243]

62 ~~) = (*x) (24) Ky -lNy ky Ky where 1m is a column vector with ones of dimension m. [sent-164, score-0.195]

63 Kx and Ky contain respectively the first N x rows and the last Ny rows of the Gram matrix K (defined as Kij = cp(zdTcp(zj) = K(Zi,Zj)). [sent-165, score-0.074]

64 We can also write (23) as K= - Kx I m~n II ~"f12 Ky + II. [sent-166, score-0.06]

65 T - - "f (kx - ky) = 1, (25) which is a second order cone program (SOCP) [5] that has the same form as the SOCP in (16) and can thus be solved in a similar way. [sent-169, score-0.071]

66 Thus the dimension of the optimization problem increases, but the solution is more powerful because the kernelization corresponds to a more complex decision boundary in ~n . [sent-171, score-0.229]

67 Similarly, the optimal value b* of b in (21) will then become (26) "'* are the optimal values of "f and", respectively. [sent-172, score-0.066]

68 Once "f* is known, we get "'* = 1/ ( J~z "f;K~Kx"f* + J~y "f;K~Ky"f* ) and then where "f* and b* from (26). [sent-173, score-0.021]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cp', 0.619), ('jat', 0.412), ('ky', 0.195), ('kx', 0.177), ('xa', 0.152), ('socp', 0.147), ('ya', 0.143), ('minimax', 0.143), ('aty', 0.14), ('berkeley', 0.121), ('ao', 0.12), ('atx', 0.118), ('ny', 0.108), ('rf', 0.103), ('fu', 0.1), ('ew', 0.093), ('zn', 0.093), ('jar', 0.088), ('kernelization', 0.088), ('hyperplane', 0.078), ('misclassification', 0.077), ('yi', 0.076), ('zi', 0.071), ('distributional', 0.071), ('eecs', 0.07), ('max', 0.064), ('optimization', 0.062), ('write', 0.06), ('xi', 0.059), ('bertsimas', 0.059), ('boils', 0.059), ('sethuraman', 0.059), ('discriminant', 0.055), ('decision', 0.054), ('cop', 0.051), ('programme', 0.051), ('gert', 0.051), ('inf', 0.051), ('jrn', 0.051), ('classification', 0.051), ('classified', 0.05), ('constraint', 0.049), ('mercer', 0.048), ('cone', 0.047), ('california', 0.043), ('monotone', 0.043), ('densities', 0.042), ('sup', 0.041), ('yj', 0.041), ('solve', 0.041), ('pr', 0.037), ('covariance', 0.037), ('ek', 0.036), ('orthogonal', 0.035), ('objective', 0.035), ('attempt', 0.034), ('optimal', 0.033), ('eliminate', 0.03), ('appealing', 0.029), ('notice', 0.029), ('ii', 0.029), ('zj', 0.026), ('lb', 0.026), ('nz', 0.026), ('chiranjib', 0.026), ('kernelizing', 0.026), ('ca', 0.025), ('setting', 0.025), ('rows', 0.025), ('problem', 0.025), ('exploit', 0.024), ('discriminative', 0.024), ('formulation', 0.024), ('program', 0.024), ('respectively', 0.024), ('future', 0.023), ('casts', 0.023), ('positively', 0.023), ('laurent', 0.023), ('justification', 0.023), ('want', 0.023), ('assumptions', 0.023), ('subspace', 0.023), ('evaluating', 0.023), ('nonlinear', 0.022), ('respect', 0.022), ('convex', 0.022), ('tw', 0.022), ('pw', 0.022), ('kij', 0.022), ('disappears', 0.022), ('equalities', 0.022), ('jordan', 0.021), ('thereby', 0.021), ('otherwise', 0.021), ('complexity', 0.021), ('get', 0.021), ('utilizing', 0.02), ('rearranging', 0.02), ('reformulate', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 120 nips-2001-Minimax Probability Machine

Author: Gert Lanckriet, Laurent E. Ghaoui, Chiranjib Bhattacharyya, Michael I. Jordan

Abstract: When constructing a classifier, the probability of correct classification of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassification of future data is obtained explicitly. 1

2 0.079786278 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

Author: Carlotta Domeniconi, Dimitrios Gunopulos

Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1

3 0.06681107 171 nips-2001-Spectral Relaxation for K-means Clustering

Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon

Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1

4 0.059425611 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

Author: Christophe Andrieu, Nando D. Freitas, Arnaud Doucet

Abstract: In this paper, we extend the Rao-Blackwellised particle filtering method to more complex hybrid models consisting of Gaussian latent variables and discrete observations. This is accomplished by augmenting the models with artificial variables that enable us to apply Rao-Blackwellisation. Other improvements include the design of an optimal importance proposal distribution and being able to swap the sampling an selection steps to handle outliers. We focus on sequential binary classifiers that consist of linear combinations of basis functions , whose coefficients evolve according to a Gaussian smoothness prior. Our results show significant improvements. 1

5 0.057646312 37 nips-2001-Associative memory in realistic neuronal networks

Author: Peter E. Latham

Abstract: Almost two decades ago , Hopfield [1] showed that networks of highly reduced model neurons can exhibit multiple attracting fixed points, thus providing a substrate for associative memory. It is still not clear, however, whether realistic neuronal networks can support multiple attractors. The main difficulty is that neuronal networks in vivo exhibit a stable background state at low firing rate, typically a few Hz. Embedding attractor is easy; doing so without destabilizing the background is not. Previous work [2, 3] focused on the sparse coding limit, in which a vanishingly small number of neurons are involved in any memory. Here we investigate the case in which the number of neurons involved in a memory scales with the number of neurons in the network. In contrast to the sparse coding limit, we find that multiple attractors can co-exist robustly with a stable background state. Mean field theory is used to understand how the behavior of the network scales with its parameters, and simulations with analog neurons are presented. One of the most important features of the nervous system is its ability to perform associative memory. It is generally believed that associative memory is implemented using attractor networks - experimental studies point in that direction [4- 7], and there are virtually no competing theoretical models. Perhaps surprisingly, however, it is still an open theoretical question whether attractors can exist in realistic neuronal networks. The

6 0.053279866 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

7 0.05024866 170 nips-2001-Spectral Kernel Methods for Clustering

8 0.049679901 74 nips-2001-Face Recognition Using Kernel Methods

9 0.048398249 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

10 0.046360288 40 nips-2001-Batch Value Function Approximation via Support Vectors

11 0.045144297 136 nips-2001-On the Concentration of Spectral Properties

12 0.043741602 143 nips-2001-PAC Generalization Bounds for Co-training

13 0.043506544 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

14 0.04339302 56 nips-2001-Convolution Kernels for Natural Language

15 0.043296263 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

16 0.043198273 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

17 0.042919304 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

18 0.042249557 164 nips-2001-Sampling Techniques for Kernel Methods

19 0.040632244 62 nips-2001-Duality, Geometry, and Support Vector Regression

20 0.04041221 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.117), (1, 0.043), (2, 0.006), (3, -0.041), (4, 0.059), (5, -0.01), (6, 0.01), (7, -0.019), (8, -0.041), (9, 0.021), (10, 0.035), (11, 0.004), (12, 0.036), (13, 0.012), (14, 0.114), (15, -0.034), (16, -0.031), (17, -0.022), (18, -0.011), (19, -0.022), (20, -0.043), (21, -0.007), (22, -0.003), (23, 0.014), (24, -0.02), (25, -0.063), (26, -0.018), (27, 0.075), (28, 0.004), (29, 0.039), (30, -0.039), (31, 0.025), (32, 0.001), (33, -0.143), (34, 0.029), (35, 0.083), (36, 0.046), (37, -0.036), (38, 0.065), (39, 0.065), (40, 0.087), (41, -0.026), (42, -0.071), (43, -0.004), (44, -0.049), (45, -0.08), (46, 0.047), (47, -0.028), (48, -0.151), (49, -0.097)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93068254 120 nips-2001-Minimax Probability Machine

Author: Gert Lanckriet, Laurent E. Ghaoui, Chiranjib Bhattacharyya, Michael I. Jordan

Abstract: When constructing a classifier, the probability of correct classification of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassification of future data is obtained explicitly. 1

2 0.53805351 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

Author: James M. Coughlan, Alan L. Yuille

Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1

3 0.46220696 180 nips-2001-The Concave-Convex Procedure (CCCP)

Author: Alan L. Yuille, Anand Rangarajan

Abstract: We introduce the Concave-Convex procedure (CCCP) which constructs discrete time iterative dynamical systems which are guaranteed to monotonically decrease global optimization/energy functions. It can be applied to (almost) any optimization problem and many existing algorithms can be interpreted in terms of CCCP. In particular, we prove relationships to some applications of Legendre transform techniques. We then illustrate CCCP by applications to Potts models, linear assignment, EM algorithms, and Generalized Iterative Scaling (GIS). CCCP can be used both as a new way to understand existing optimization algorithms and as a procedure for generating new algorithms. 1

4 0.43709734 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

5 0.40325594 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

Author: Christophe Andrieu, Nando D. Freitas, Arnaud Doucet

Abstract: In this paper, we extend the Rao-Blackwellised particle filtering method to more complex hybrid models consisting of Gaussian latent variables and discrete observations. This is accomplished by augmenting the models with artificial variables that enable us to apply Rao-Blackwellisation. Other improvements include the design of an optimal importance proposal distribution and being able to swap the sampling an selection steps to handle outliers. We focus on sequential binary classifiers that consist of linear combinations of basis functions , whose coefficients evolve according to a Gaussian smoothness prior. Our results show significant improvements. 1

6 0.39561385 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

7 0.37497818 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

8 0.37329525 171 nips-2001-Spectral Relaxation for K-means Clustering

9 0.37095696 136 nips-2001-On the Concentration of Spectral Properties

10 0.36410579 62 nips-2001-Duality, Geometry, and Support Vector Regression

11 0.33990556 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

12 0.33023557 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

13 0.3268345 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

14 0.32200092 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

15 0.31653327 190 nips-2001-Thin Junction Trees

16 0.30850181 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

17 0.30265722 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

18 0.30224764 36 nips-2001-Approximate Dynamic Programming via Linear Programming

19 0.3018727 155 nips-2001-Quantizing Density Estimators

20 0.28176138 172 nips-2001-Speech Recognition using SVMs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.326), (14, 0.028), (17, 0.046), (19, 0.031), (27, 0.134), (30, 0.046), (59, 0.048), (72, 0.083), (79, 0.033), (83, 0.011), (91, 0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84515744 120 nips-2001-Minimax Probability Machine

Author: Gert Lanckriet, Laurent E. Ghaoui, Chiranjib Bhattacharyya, Michael I. Jordan

Abstract: When constructing a classifier, the probability of correct classification of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassification of future data is obtained explicitly. 1

2 0.73362291 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs

Author: Martin J. Wainwright, Tommi Jaakkola, Alan S. Willsky

Abstract: We develop a tree-based reparameterization framework that provides a new conceptual view of a large class of iterative algorithms for computing approximate marginals in graphs with cycles. It includes belief propagation (BP), which can be reformulated as a very local form of reparameterization. More generally, we consider algorithms that perform exact computations over spanning trees of the full graph. On the practical side, we find that such tree reparameterization (TRP) algorithms have convergence properties superior to BP. The reparameterization perspective also provides a number of theoretical insights into approximate inference, including a new characterization of fixed points; and an invariance intrinsic to TRP /BP. These two properties enable us to analyze and bound the error between the TRP /BP approximations and the actual marginals. While our results arise naturally from the TRP perspective, most of them apply in an algorithm-independent manner to any local minimum of the Bethe free energy. Our results also have natural extensions to more structured approximations [e.g. , 1, 2]. 1

3 0.54657811 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

Author: Marzia Polito, Pietro Perona

Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1

4 0.54233563 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.53917408 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

6 0.53839082 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

7 0.53799719 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

8 0.53690213 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

9 0.53388053 36 nips-2001-Approximate Dynamic Programming via Linear Programming

10 0.53328872 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

11 0.53315949 121 nips-2001-Model-Free Least-Squares Policy Iteration

12 0.53268689 114 nips-2001-Learning from Infinite Data in Finite Time

13 0.53244251 143 nips-2001-PAC Generalization Bounds for Co-training

14 0.53173667 8 nips-2001-A General Greedy Approximation Algorithm with Applications

15 0.53065085 84 nips-2001-Global Coordination of Local Linear Models

16 0.53051889 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

17 0.53022498 155 nips-2001-Quantizing Density Estimators

18 0.52976167 58 nips-2001-Covariance Kernels from Bayesian Generative Models

19 0.52962184 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

20 0.52870876 61 nips-2001-Distribution of Mutual Information