nips nips2003 nips2003-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan
Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose a novel method of dimensionality reduction for supervised learning. [sent-9, score-0.387]
2 Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . [sent-10, score-0.551]
3 We show that this problem can be formulated in terms of conditional independence. [sent-11, score-0.175]
4 To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. [sent-12, score-0.748]
5 Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . [sent-13, score-0.316]
6 1 Introduction Many statistical learning problems involve some form of dimensionality reduction. [sent-14, score-0.194]
7 The goal may be one of feature selection, in which we aim to find linear or nonlinear combinations of the original set of variables, or one of variable selection, in which we wish to select a subset of variables from the original set. [sent-15, score-0.146]
8 Motivations for such dimensionality reduction include providing a simplified explanation and visualization for a human, suppressing noise so as to make a better prediction or decision, or reducing the computational burden. [sent-16, score-0.416]
9 We study dimensionality reduction for supervised learning, in which the data consists of (X, Y ) pairs, where X is an m-dimensional explanatory variable and Y is an -dimensional response. [sent-17, score-0.486]
10 We refer to these problems generically as “regression,” which indicates our focus on the conditional probability density pY |X (y|x). [sent-19, score-0.175]
11 We wish to solve a problem of feature selection in which the features are linear combinations of the components of X. [sent-21, score-0.105]
12 In particular, we assume that there is an r-dimensional subspace S ⊂ Rm such that the following equality holds for all x and y: (1) pY |X (y|x) = pY |ΠS X (y|ΠS x), where ΠS is the orthogonal projection of Rm onto S. [sent-22, score-0.255]
13 The subspace S is called the effective subspace for regression. [sent-23, score-0.427]
14 We approach the problem within a semiparametric statistical framework—we make no assumptions regarding the conditional distribution pY |ΠS X (y|ΠS x) or the distribution pX (x) of X. [sent-25, score-0.269]
15 Having found an effective subspace, we may then proceed to build a parametric or nonparametric regression model on that subspace. [sent-26, score-0.172]
16 Thus our approach is an explicit dimensionality reduction method for supervised learning that does not require any particular form of regression model; it can be used as a preprocessor for any supervised learner. [sent-27, score-0.502]
17 Most conventional approaches to dimensionality reduction make specific assumptions regarding the conditional distribution pY |ΠS X (y|ΠS x), the marginal distribution pX (x), or both. [sent-28, score-0.651]
18 For example, classical two-layer neural networks can be seen as attempting to estimate an effective subspace in their first layer, using a specific model for the regressor. [sent-29, score-0.268]
19 Similar comments apply to projection pursuit regression [1] and ACE [2], which assume T T an additive model E[Y |X] = g1 (β1 X) + · · · + gK (βK X). [sent-30, score-0.132]
20 While canonical correlation analysis (CCA) and partial least squares (PLS, [3]) can be used for dimensionality reduction in regression, they make a linearity assumption and place strong restrictions on the allowed dimensionality. [sent-31, score-0.362]
21 The line of research that is closest to our work is sliced inverse regression (SIR, [4]) and related methods including principal Hessian directions (pHd, [5]). [sent-32, score-0.125]
22 SIR is a semiparametric method that can find effective subspaces, but only under strong assumptions of ellipticity for the marginal distribution pX (x). [sent-33, score-0.259]
23 If these assumptions do not hold, there is no guarantee of finding the effective subspace. [sent-35, score-0.149]
24 In this paper we present a novel semiparametric method for dimensionality reduction that we refer to as Kernel Dimensionality Reduction (KDR). [sent-36, score-0.389]
25 KDR is based on a particular class of operators on reproducing kernel Hilbert spaces (RKHS, [6]). [sent-37, score-0.312]
26 In distinction to algorithms such as the support vector machine and kernel PCA [7, 8], KDR cannot be viewed as a “kernelization” of an underlying linear algorithm. [sent-38, score-0.118]
27 Rather, we relate dimensionality reduction to conditional independence of variables, and use RKHSs to provide characterizations of conditional independence and thereby design objective functions for optimization. [sent-39, score-0.951]
28 This builds on the earlier work of [9], who used RKHSs to characterize marginal independence of variables. [sent-40, score-0.2]
29 Our characterization of conditional independence is a significant extension, requiring rather different mathematical tools—the covariance operators on RKHSs that we present in Section 2. [sent-41, score-0.47]
30 1 Kernel method of dimensionality reduction for regression Dimensionality reduction and conditional independence The problem discussed in this paper is to find the effective subspace S defined by Eq. [sent-44, score-1.1]
31 sample {(Xi , Yi )}n , sampled from the conditional probability Eq. [sent-48, score-0.175]
32 The crux of the problem is that we have no a priori knowledge of the regressor, and place no assumptions on the conditional probability pY |X or the marginal probability pX . [sent-50, score-0.271]
33 We do not address the problem of choosing the dimensionality r in this paper—in practical applications of KDR any of a variety of model selection methods such as cross-validation can be reasonably considered. [sent-51, score-0.242]
34 Rather our focus is on the problem of finding the effective subspace for a given choice of dimensionality. [sent-52, score-0.268]
35 The notion of effective subspace can be formulated in terms of conditional independence. [sent-53, score-0.443]
36 Let Q = (B, C) be an m-dimensional orthogonal matrix such that the column vectors of B span the subspace S (thus B is m × r and C is m × (m − r)), and define U = B T X and V = C T X. [sent-54, score-0.203]
37 (2) Y Y Y V X U X V |U X = (U,V) Figure 1: Graphical representation of dimensionality reduction for regression. [sent-58, score-0.335]
38 This shows that the effective subspace S is the one which makes Y and V conditionally independent given U (see Figure 1). [sent-59, score-0.268]
39 Mutual information provides another viewpoint on the equivalence between conditional independence and the effective subspace. [sent-60, score-0.402]
40 (1) implies I(Y, X) = I(Y, U ), the effective subspace S is characterized as the subspace which retains the entire mutual information between X and Y , or equivalently, such that I(Y |U, V |U ) = 0. [sent-63, score-0.56]
41 This is again the conditional independence of Y and V given U . [sent-64, score-0.293]
42 2 Covariance operators on kernel Hilbert spaces and conditional independence We use cross-covariance operators [10] on RKHSs to characterize the conditional independence of random variables. [sent-66, score-0.939]
43 Let (H, k) be a (real) reproducing kernel Hilbert space of functions on a set Ω with a positive definite kernel k : Ω × Ω → R and an inner product ·, · H . [sent-67, score-0.266]
44 The most important aspect of a RKHS is the reproducing property: f, k(·, x) H = f (x) for all x ∈ Ω and f ∈ H. [sent-68, score-0.076]
45 (4) In this paper we focus on the Gaussian kernel k(x1 , x2 ) = exp − x1 − x2 2 /2σ 2 . [sent-69, score-0.095]
46 Let (H1 , k1 ) and (H2 , k2 ) be RKHSs over measurable spaces (Ω1 , B1 ) and (Ω2 , B2 ), respectively, with k1 and k2 measurable. [sent-70, score-0.147]
47 (5) implies that the covariance of f (X) and g(Y ) is given by the action of the linear operator ΣY X and the inner product. [sent-73, score-0.135]
48 Under the assumption that EX [k1 (X, X)] and EY [k2 (Y, Y )] are finite, by using Riesz’s representation theorem, it is not difficult to see that a bounded operator ΣY X is uniquely defined by Eq. [sent-74, score-0.077]
49 Cross-covariance operators provide a useful framework for discussing conditional probability and conditional independence, as shown by the following theorem and its corollary1 : Theorem 1. [sent-79, score-0.506]
50 Let (H1 , k1 ) and (H2 , k2 ) be RKHSs on measurable spaces Ω1 and Ω2 , respectively, with k1 and k2 measurable, and (X, Y ) be a random vector on Ω1 ×Ω2 . [sent-80, score-0.147]
51 Assume that EX [k1 (X, X)] and EY [k2 (Y, Y )] are finite, and for all g ∈ H2 the conditional expectation EY |X [g(Y ) | X = ·] is an element of H1 . [sent-81, score-0.199]
52 (8) This can be understood by analogy to the conditional expectation of Gaussian random variables. [sent-94, score-0.199]
53 If X and Y are Gaussian random variables, it is well-known that the conditional expectation is given by EY |X [aT Y | X = x] = xT Σ−1 ΣXY a for an arbitrary vector a, XX where ΣXX and ΣXY are the variance-covariance matrices in the ordinary sense. [sent-95, score-0.199]
54 Using cross-covariance operators, we derive an objective function for characterizing conditional independence. [sent-96, score-0.205]
55 Let (H1 , k1 ) and (H2 , k2 ) be RKHSs on measurable spaces Ω1 and Ω2 , respectively, with k1 and k2 measurable, and suppose we have random variables U ∈ H1 and Y ∈ H2 . [sent-97, score-0.198]
56 We define the conditional covariance operator ΣY Y |U on H1 by ˜ ΣY Y |U := ΣY Y − ΣY U Σ−1 ΣU Y . [sent-98, score-0.289]
57 UU (9) Corollary 2 easily yields the following result on the conditional covariance of variables: Theorem 3. [sent-99, score-0.233]
58 Let (Ω, B) be a measurable space, let (H, k) be a RKHS over Ω with k measurable and bounded, and let M be the set of all the probability measures on (Ω, B). [sent-107, score-0.24]
59 The following theorem can be proved using a argument similar to that used in the proof of Theorem 2 in [9]. [sent-109, score-0.065]
60 For an arbitrary σ > 0, the RKHS with Gaussian kernel k(x, y) = exp(− x− y 2 /2σ 2 ) on Rm is probability-determining. [sent-111, score-0.095]
61 Recall that for two RKHSs H1 and H2 on Ω1 and Ω2 , respectively, the direct product H1 ⊗H2 is the RKHS on Ω1 ×Ω2 with the kernel k1 k2 [6]. [sent-112, score-0.095]
62 The relation between conditional independence and the conditional covariance operator is given by the following theorem: Theorem 5. [sent-113, score-0.582]
63 Let (H11 , k11 ), (H12 , k12 ), and (H2 , k2 ) be RKHSs on measurable spaces Ω11 , Ω12 , and Ω2 , respectively, with continuous and bounded kernels. [sent-114, score-0.168]
64 Taking the expectation of the well-known equality VarY |U [g(Y )|U ] = EV |U VarY |U,V [g(Y )|U, V ] + VarV |U EY |U,V [g(Y )|U, V ] with respect to U , we obtain EU VarY |U [g(Y )|U ] −EX VarY |X [g(Y )|X] = EU VarV |U [EY |X [g(Y )|X]] ≥ 0, which implies Eq. [sent-120, score-0.082]
65 ⊥V From Theorem 5, for probability-determining kernel spaces, the effective subspace S can be characterized in terms of the solution to the following minimization problem: min ΣY Y |U , S 2. [sent-126, score-0.386]
66 (14) Kernel generalized variance for dimensionality reduction To derive a sampled-based objective function from Eq. [sent-128, score-0.365]
67 (14) for a finite sample, we have to estimate the conditional covariance operator with given data, and choose a specific way to evaluate the size of self-adjoint operators. [sent-129, score-0.289]
68 With a regularization ˆ constant ε > 0, the empirical conditional covariance matrix ΣY Y |U is then defined by ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ΣY Y |U := ΣY Y − ΣY U Σ−1 ΣU Y = (KY + εIn )2 − KY KU (KU + εIn )−2 KU KY . [sent-135, score-0.233]
69 Using the A Schur decomposition, det(A − BC −1 B T ) = det B T B /detC, we have C ˆ ˆ ˆ det ΣY Y |U = det Σ[Y U ][Y U ] / det ΣU U , ˆ ˆ where Σ[Y U ][Y U ] is defined by Σ[Y U ][Y U ] = ˆ ˆ Σ Y Y ΣY U ˆ ˆ ΣU Y Σ U U = (17) ˆ ˆ ˆ (KY +εIn )2 KY KU ˆ ˆ ˆ (KU +εIn )2 KU KY . [sent-138, score-0.376]
70 ˆ We symmetrize the objective function by dividing by the constant det ΣY Y , which yields min m×r B∈R ˆ det Σ[Y U ][Y U ] , ˆ ˆ det ΣY Y det ΣU U where U = B T X. [sent-139, score-0.406]
71 (18) We refer to this minimization problem with respect to the choice of subspace S or matrix B as Kernel Dimensionality Reduction (KDR). [sent-140, score-0.159]
72 I(Y, U ), and with an entirely different argument, we have shown that KGV is an appropriate objective function for the dimensionality reduction problem, and that minimizing Eq. [sent-159, score-0.365]
73 Given that the numerical task that must be solved in KDR is the same as the one to be solved in kernel ICA, we can import all of the computational techniques developed in [9] for minimizing KGV. [sent-161, score-0.095]
74 To cope with local optima, we make use of an annealing technique, in which the scale parameter σ for the Gaussian kernel is decreased gradually during the iterations of optimization. [sent-163, score-0.095]
75 Next, we apply the KDR method to classification problems, for which many conventional methods of dimensionality reduction are not suitable. [sent-180, score-0.38]
76 In particular, SIR requires the dimensionality of the effective subspace to be less than the number of classes, because SIR uses the average of X in slices along the variable Y . [sent-181, score-0.5]
77 CCA and PLS have a similar limitation on the dimensionality of the effective subspace. [sent-182, score-0.303]
78 We show the visualization capability of the dimensionality reduction methods for the Wine dataset from the UCI repository to see how the projection onto a low-dimensional space realizes an effective description of data. [sent-184, score-0.53]
79 The Wine data consists of 178 samples with 13 variables and a label with three classes. [sent-185, score-0.073]
80 Figure 2 shows the projection onto the 2-dimensional subspace estimated by each method. [sent-186, score-0.195]
81 These results show that KDR successfully finds an effective subspace which preserves the class information even when the dimensionality is reduced significantly. [sent-197, score-0.462]
82 4 Extension to variable selection The KDR method can be extended to variable selection, in which a subset of given explanatory variables {X1 , . [sent-198, score-0.236]
83 Extension of the KGV objective function to variable selection is straightforward. [sent-202, score-0.116]
84 We have only to compare the KGV values for all the subspaces spanned by combinations of a fixed number of selected variables. [sent-203, score-0.104]
85 We of course do not avoid the combinatorial problem of variable selection; the total number of combinations may be intractably large for a large number of explanatory variables m, and greedy or random search procedures are needed. [sent-204, score-0.178]
86 We first apply this kernel method to the Boston Housing data (506 samples with 13 dimensional X), which has been used as a typical example of variable selection. [sent-205, score-0.155]
87 The selected variables are exactly the same as the ones selected by ACE [2]. [sent-207, score-0.109]
88 We select 50 effective genes to classify two types of leukemia using 38 training samples. [sent-209, score-0.206]
89 For optimization of the KGV value, we use a greedy algorithm, in which new variables are selected one by one, and subsequently a variant of genetic algorithm is used. [sent-210, score-0.08]
90 Half of the 50 genes accord with 50 genes selected by [12]. [sent-211, score-0.161]
91 With the genes selected by our method, the same classifier as that used in [12] classifies correctly 32 of the 34 test samples, for which, with their 50 genes, Golub et al. [sent-212, score-0.095]
92 5 Conclusion We have presented KDR, a novel method of dimensionality reduction for supervised learning. [sent-214, score-0.387]
93 essentially all existing methods for dimensionality reduction in regression, including SIR, pHd, CCA, and PPR. [sent-217, score-0.335]
94 We have demonstrating promising empirical performance of KDR, showing its practical utility in data visualization and feature selection for prediction. [sent-218, score-0.098]
95 The theoretical basis of KDR lies in the nonparametric characterization of conditional independence that we have presented in this paper. [sent-220, score-0.321]
96 Extending earlier work on the kernel-based characterization of marginal independence [9], we have shown that conditional independence can be characterized in terms of covariance operators on a kernel Hilbert space. [sent-221, score-0.762]
97 While our focus has been on the problem of dimensionality reduction, it is also worth noting that there are many possible other applications of this result. [sent-222, score-0.194]
98 In particular, conditional independence plays an important role in the structural definition of graphical models, and our result may have implications for model selection and inference in graphical models. [sent-223, score-0.341]
99 Nonlinear component analysis as a kernel eigenvalue o u problem. [sent-288, score-0.095]
100 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-309, score-0.364]
wordName wordTfidf (topN-words)
[('kdr', 0.437), ('ey', 0.343), ('sir', 0.291), ('xx', 0.246), ('dimensionality', 0.194), ('rkhss', 0.187), ('py', 0.181), ('conditional', 0.175), ('subspace', 0.159), ('kgv', 0.146), ('ku', 0.146), ('xy', 0.145), ('reduction', 0.141), ('ky', 0.126), ('independence', 0.118), ('effective', 0.109), ('phd', 0.106), ('px', 0.101), ('cca', 0.099), ('measurable', 0.097), ('kernel', 0.095), ('det', 0.094), ('operators', 0.091), ('pls', 0.083), ('rkhs', 0.082), ('reproducing', 0.076), ('bach', 0.072), ('eu', 0.069), ('ex', 0.068), ('genes', 0.066), ('theorem', 0.065), ('mutual', 0.064), ('hilbert', 0.064), ('regression', 0.063), ('ker', 0.062), ('explanatory', 0.061), ('covariance', 0.058), ('marginal', 0.056), ('operator', 0.056), ('wine', 0.054), ('fukumizu', 0.054), ('semiparametric', 0.054), ('supervised', 0.052), ('variables', 0.051), ('visualization', 0.05), ('spaces', 0.05), ('uu', 0.049), ('selection', 0.048), ('subspaces', 0.047), ('conventional', 0.045), ('classi', 0.043), ('ace', 0.042), ('sliced', 0.042), ('varv', 0.042), ('assumptions', 0.04), ('classification', 0.039), ('variable', 0.038), ('equality', 0.037), ('projection', 0.036), ('pursuit', 0.033), ('vary', 0.033), ('leukemia', 0.031), ('suppressing', 0.031), ('ionosphere', 0.031), ('gu', 0.031), ('golub', 0.031), ('gy', 0.031), ('objective', 0.03), ('jordan', 0.03), ('selected', 0.029), ('wish', 0.029), ('optima', 0.029), ('jmlr', 0.029), ('characterization', 0.028), ('combinations', 0.028), ('rm', 0.028), ('cancer', 0.027), ('restrictions', 0.027), ('characterize', 0.026), ('gram', 0.026), ('cov', 0.026), ('ui', 0.025), ('retains', 0.025), ('hessian', 0.025), ('respectively', 0.025), ('expectation', 0.024), ('cs', 0.023), ('distinction', 0.023), ('characterized', 0.023), ('let', 0.023), ('orthogonal', 0.023), ('extension', 0.023), ('samples', 0.022), ('sketch', 0.022), ('uci', 0.022), ('corollary', 0.021), ('span', 0.021), ('bounded', 0.021), ('implies', 0.021), ('inverse', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan
Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1
2 0.084848791 92 nips-2003-Information Bottleneck for Gaussian Variables
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal definition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoff - the information curve - in terms of the eigenvalue spectrum. 1
3 0.083558962 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
4 0.078255937 115 nips-2003-Linear Dependent Dimensionality Reduction
Author: Nathan Srebro, Tommi S. Jaakkola
Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1
5 0.071105689 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
6 0.071100697 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
7 0.070752218 112 nips-2003-Learning to Find Pre-Images
8 0.070597939 124 nips-2003-Max-Margin Markov Networks
9 0.069331303 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
10 0.068000704 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
11 0.066227809 107 nips-2003-Learning Spectral Clustering
12 0.064190738 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
13 0.063982092 1 nips-2003-1-norm Support Vector Machines
14 0.063388728 51 nips-2003-Design of Experiments via Information Theory
15 0.060573466 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
16 0.06032639 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
17 0.058510631 128 nips-2003-Minimax Embeddings
18 0.057660617 126 nips-2003-Measure Based Regularization
19 0.056517798 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
20 0.055872716 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
topicId topicWeight
[(0, -0.174), (1, -0.082), (2, -0.031), (3, -0.035), (4, 0.034), (5, 0.093), (6, -0.001), (7, -0.038), (8, -0.0), (9, -0.032), (10, -0.018), (11, 0.03), (12, 0.031), (13, 0.009), (14, -0.012), (15, 0.02), (16, 0.044), (17, 0.109), (18, 0.111), (19, 0.019), (20, -0.047), (21, 0.052), (22, 0.07), (23, 0.015), (24, 0.002), (25, -0.042), (26, 0.081), (27, 0.037), (28, -0.042), (29, -0.038), (30, 0.006), (31, -0.0), (32, -0.101), (33, 0.011), (34, 0.077), (35, -0.127), (36, -0.032), (37, 0.032), (38, 0.177), (39, -0.062), (40, 0.083), (41, -0.021), (42, 0.063), (43, 0.033), (44, 0.003), (45, 0.101), (46, -0.061), (47, -0.204), (48, -0.116), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.93365759 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan
Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1
2 0.56865102 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste
Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.
3 0.53136498 92 nips-2003-Information Bottleneck for Gaussian Variables
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal definition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoff - the information curve - in terms of the eigenvalue spectrum. 1
4 0.46309412 51 nips-2003-Design of Experiments via Information Theory
Author: Liam Paninski
Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1
5 0.46088642 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
6 0.43115708 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
7 0.41801029 115 nips-2003-Linear Dependent Dimensionality Reduction
8 0.41584954 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
9 0.3909207 120 nips-2003-Locality Preserving Projections
10 0.3895233 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
11 0.37965637 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
12 0.37832028 3 nips-2003-AUC Optimization vs. Error Rate Minimization
13 0.37820432 176 nips-2003-Sequential Bayesian Kernel Regression
14 0.37624493 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
15 0.37383798 128 nips-2003-Minimax Embeddings
16 0.37325704 1 nips-2003-1-norm Support Vector Machines
17 0.37268591 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
18 0.36791188 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
19 0.35860562 66 nips-2003-Extreme Components Analysis
20 0.34345943 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
topicId topicWeight
[(0, 0.035), (11, 0.012), (29, 0.013), (30, 0.028), (35, 0.055), (53, 0.149), (66, 0.315), (71, 0.076), (76, 0.056), (85, 0.057), (91, 0.095), (99, 0.019)]
simIndex simValue paperId paperTitle
1 0.95369834 195 nips-2003-When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
Author: David Donoho, Victoria Stodden
Abstract: We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of ‘parts’ and NMF correctly identifies the ‘parts’. We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases. 1
2 0.83342236 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects
Author: Xuerui Wang, Rebecca Hutchinson, Tom M. Mitchell
Abstract: We consider learning to classify cognitive states of human subjects, based on their brain activity observed via functional Magnetic Resonance Imaging (fMRI). This problem is important because such classifiers constitute “virtual sensors” of hidden cognitive states, which may be useful in cognitive science research and clinical applications. In recent work, Mitchell, et al. [6,7,9] have demonstrated the feasibility of training such classifiers for individual human subjects (e.g., to distinguish whether the subject is reading an ambiguous or unambiguous sentence, or whether they are reading a noun or a verb). Here we extend that line of research, exploring how to train classifiers that can be applied across multiple human subjects, including subjects who were not involved in training the classifier. We describe the design of several machine learning approaches to training multiple-subject classifiers, and report experimental results demonstrating the success of these methods in learning cross-subject classifiers for two different fMRI data sets. 1
same-paper 3 0.82296705 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan
Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1
4 0.79502469 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
Author: Leonid Sigal, Michael Isard, Benjamin H. Sigelman, Michael J. Black
Abstract: The detection and pose estimation of people in images and video is made challenging by the variability of human appearance, the complexity of natural scenes, and the high dimensionality of articulated body models. To cope with these problems we represent the 3D human body as a graphical model in which the relationships between the body parts are represented by conditional probability distributions. We formulate the pose estimation problem as one of probabilistic inference over a graphical model where the random variables correspond to the individual limb parameters (position and orientation). Because the limbs are described by 6-dimensional vectors encoding pose in 3-space, discretization is impractical and the random variables in our model must be continuousvalued. To approximate belief propagation in such a graph we exploit a recently introduced generalization of the particle filter. This framework facilitates the automatic initialization of the body-model from low level cues and is robust to occlusion of body parts and scene clutter. 1
5 0.64853531 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
Author: Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman
Abstract: We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be “intermediate” for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children. 1
6 0.59907222 115 nips-2003-Linear Dependent Dimensionality Reduction
7 0.59625304 107 nips-2003-Learning Spectral Clustering
8 0.57403022 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
9 0.57162714 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
10 0.56040668 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
11 0.55771816 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
12 0.55768538 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
13 0.55660391 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
14 0.55576205 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
15 0.55520505 126 nips-2003-Measure Based Regularization
16 0.54764247 30 nips-2003-Approximability of Probability Distributions
17 0.5468694 66 nips-2003-Extreme Components Analysis
18 0.54614574 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
19 0.54602695 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
20 0.54592562 113 nips-2003-Learning with Local and Global Consistency