nips nips2004 nips2004-198 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fi/projects/bayes/ Abstract In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. [sent-9, score-0.484]
2 The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. [sent-10, score-0.326]
3 This yields an accurate approximation for cases of large posterior variance. [sent-11, score-0.271]
4 The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. [sent-12, score-0.398]
5 This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. [sent-13, score-0.582]
6 1 Introduction Linear latent variable models such as factor analysis, principal component analysis (PCA) and independent component analysis (ICA) [1] are used in many applications ranging from engineering to social sciences and psychology. [sent-14, score-0.236]
7 In many of these cases, the effect of the desired factors or sources to the observed data is, however, not linear. [sent-15, score-0.207]
8 The method presented in this paper can be used as a basis for many nonlinear latent variable models, such as nonlinear generalizations of the above models. [sent-17, score-0.541]
9 It is based on the variational Bayesian framework, which provides a solid foundation for nonlinear modeling that would otherwise be prone to overfitting [2]. [sent-18, score-0.41]
10 It also allows for easy comparison of different model structures, which is even more important for flexible nonlinear models than for simpler linear models. [sent-19, score-0.245]
11 General nonlinear generative models for data x(t) of the type x(t) = f (s(t), θ f ) + n(t) = Bφ(As(t) + a) + b + n(t) (1) often employ a multi-layer perceptron (MLP) (as in the equation) or a radial basis function (RBF) network to model the nonlinearity. [sent-20, score-0.361]
12 In context of variational Bayesian methods, RBF networks seem more popular of the two because it is easier to evaluate analytic expressions and bounds for certain key quantities [3]. [sent-22, score-0.298]
13 Most existing applications of variational Bayesian methods for nonlinear models are concerned with the supervised case where the inputs of the network are known and only the weights have to be learned [3, 5]. [sent-27, score-0.577]
14 This is easier as there are fewer parameters with related posterior variance above the nonlinear hidden layer and the distributions thus tend to be easier to handle. [sent-28, score-0.577]
15 In this paper we present a novel method for evaluating the statistics of the outputs of an MLP network in context of unsupervised variational Bayesian learning of its weights and inputs. [sent-29, score-0.419]
16 The method is demonstrated with a nonlinear factor analysis problem. [sent-30, score-0.363]
17 The nonlinearity (1) can be used as a building block of many different models depending on the model assumed for the sources S. [sent-33, score-0.208]
18 Simple Gaussian prior on S leads to a nonlinear factor analysis (NFA) model [6, 7] that is studied here because of its simplicity. [sent-34, score-0.363]
19 The method could easily be extended with a mixture-of-Gaussians prior on S [8] to get a nonlinear independent factor analysis model, but this is omitted here. [sent-35, score-0.398]
20 In many nonlinear blind source separation (BSS) problems it is enough to apply simple NFA followed by linear ICA postprocessing to achieve nonlinear BSS [6, 7]. [sent-36, score-0.519]
21 In order to deal with the flexible nonlinear models, a powerful learning paradigm resistant to overfitting is needed. [sent-38, score-0.245]
22 The variational Bayesian method of ensemble learning [2] has proven useful here. [sent-39, score-0.239]
23 Ensemble learning is based on approximating the true posterior p(S, θ|X) with a tractable approximation q(S, θ), typically a multivariate Gaussian with a diagonal covariance. [sent-40, score-0.273]
24 The approximation is fitted to minimize the cost q(S, θ) C = log = D(q(S, θ)||p(S, θ|X)) − log p(X) (2) p(S, θ, X) where · denotes expectation over q(S, θ) and D(q||p) is the Kullback-Leibler divergence between q and p. [sent-41, score-0.314]
25 The cost can be evaluated analytically for a large class of mainly linear models [10, 11] leading to simple and efficient learning algorithms. [sent-43, score-0.309]
26 1 Evaluating the cost Unfortunately, the cost (2) cannot be evaluated analytically for the nonlinear model (1). [sent-45, score-0.668]
27 t (3) The term Cx depends on the first and second moments of f (s(t), θ f ) over the posterior approximation q(S, θ), and they cannot easily be evaluated analytically. [sent-47, score-0.35]
28 If the activation functions of the MLP network were linear, the output mean and variance could be evaluated exactly using only the mean and variance of the inputs s(t) and θ f . [sent-49, score-0.702]
29 Thus a natural first approximation would be to linearize the network about the input mean using derivatives [6]. [sent-50, score-0.386]
30 Due to the local nature of the approximation, this can lead to severe underestimation of the variance, especially when the hidden neurons of the MLP network operate in the saturated region. [sent-52, score-0.328]
31 This makes the nonlinear factor analysis algorithm using this approach unstable with large number of factors because the posterior variance corresponding to the last factors is typically large. [sent-53, score-0.641]
32 To avoid this problem, we propose using a Gauss–Hermite quadrature to evaluate an effective linearization of the nonlinear activation functions φ(yi (t)). [sent-54, score-0.651]
33 The Gauss– Hermite quadrature is a method for approximating weighted integrals ∞ f (x) exp(−x2 ) dx ≈ −∞ wk f (tk ), (5) k where the weights wk and abscissas tk are selected by requiring exact result for suitable number of low-order polynomials. [sent-55, score-0.534]
34 This allows evaluating the mean and variance of φ(yi (t)) by quadratures φ(yi (t))GH = wk φ y i (t) + tk yi (t) (6) k 2 wk φ y i (t) + tk φ(yi (t))GH = yi (t) − φ(yi (t))GH , (7) k respectively. [sent-56, score-0.932]
35 Here the weights and abscissas have been scaled to take into account the Gaussian pdf weight instead of exp(−x2 ), and y i (t) and yi (t) are the mean and variance of yi (t), respectively. [sent-57, score-0.571]
36 We used a three point quadrature that yields accurate enough results but can be evaluated quickly. [sent-58, score-0.441]
37 As both of the quadratures depend on φ at the same points, they can be evaluated together easily. [sent-62, score-0.299]
38 Using the approximation formula φ(yi (t)) = φ (yi (t))2 yi (t), the resulting mean and variance can be interpreted to yield an effective linearization of φ(yi (t)) through φ(yi (t))GH . [sent-63, score-0.52]
39 (8) yi (t) The positive square root is used here because the derivative of the logistic sigmoid used as activation function is always positive. [sent-64, score-0.298]
40 (4), the exact mean and variance of the linearized model can be evaluated in a relatively straightforward manner. [sent-66, score-0.36]
41 Evaluation of the variance due to the sources requires propagating matrices through the network to track the correlations between the hidden units. [sent-67, score-0.477]
42 The same problem does not affect the network weights as each parameter only affects the value of one hidden neuron. [sent-69, score-0.214]
43 2 Details of the approximation The mean and variance of φ(yi (t)) depend on the distribution of yi (t). [sent-71, score-0.455]
44 The Gauss– Hermite quadrature assumes that yi (t) is Gaussian. [sent-72, score-0.357]
45 This is not true in our case, as the product of two independent normally distributed variables aij and sj (t) is super-Gaussian, although rather close to Gaussian if the mean of one of the variables is significantly larger in absolute value than the standard deviation. [sent-73, score-0.287]
46 Ignoring the non-Gaussianity, the quadrature depends on the mean and variance of yi (t). [sent-75, score-0.511]
47 These can be evaluated exactly because of the linearity of the mapping as 2 Aij (sj (t)2 + sj (t)) + Aij sj (t) + ai , yi,tot (t) = (9) j where θ denotes the mean and θ the variance of θ. [sent-76, score-0.486]
48 In an experiment investigating the approximation accuracy with a random MLP [12], the Taylor approximation was found to underestimate the output variance by a factor of 400, at worst. [sent-79, score-0.514]
49 The worst case result of the above approximation was underestimation by a factor of 40, which is a great improvement over the Taylor approximation, but still far from perfect. [sent-80, score-0.344]
50 The worst case behavior could be improved to underestimation by a factor of 5 by introducing another quadrature evaluated with a different variance for yi (t). [sent-81, score-0.83]
51 The difference in behavior of the two methods in more realistic cases is less drastic, but the version with two quadratures seems to provide more accurate approximations. [sent-83, score-0.188]
52 The more accurate approximation is implemented by evaluating another quadrature using the variance of yi (t) originating mainly from θ f , Aij (sj (t)2 + sj (t)) + ai , yi,weight (t) = (10) j and using the implied φ (yi (t)) in the evaluation of the effects of these variances. [sent-84, score-0.841]
53 The total variance (9) is still used in evaluation of the means and the evaluation of the effects of the variance of s(t). [sent-85, score-0.337]
54 3 Learning algorithm for nonlinear factor analysis The nonlinear factor analysis (NFA) model [6] is learned by numerically minimizing the cost C evaluated above. [sent-87, score-1.046]
55 The minimization algorithm is a combination of conjugate gradient for the means of S and θ f , fixed point iteration for the variances of S and θ f , and EM like updates for other parameters and hyperparameters. [sent-88, score-0.222]
56 The fixed point update algorithm for the variances follows from writing the cost function as a sum C = Cq + Cp = log q(S, θ) + − log p(S, θ, X) . [sent-89, score-0.272]
57 (11) A parameter θi that is assumed independent of others under q and has a Gaussian posterior approximation q(θi ) = N (θi ; θi , θi ), only affects the corresponding negentropy term −1/2 log(2πeθi ) in Cq . [sent-90, score-0.223]
58 The required partial derivatives can be evaluated analytically with simple backpropagation like computations with the MLP network. [sent-94, score-0.25]
59 The quadratures used at hidden nodes lead to analytical expressions for the means and variances of the hidden nodes and the corresponding feedback gradients are easy to derive. [sent-95, score-0.436]
60 These derivatives can then be used in a conjugate gradient algorithm to update the means of S and θ f . [sent-97, score-0.179]
61 Due to the flexibility of the MLP network and the gradient based learning algorithm, the nonlinear factor analysis method is sensitive to the initialization. [sent-98, score-0.491]
62 We have used linear PCA for initialization of the means of the sources S. [sent-99, score-0.194]
63 After this, the sources are kept fixed for 20 iterations while only the network weights are updated. [sent-101, score-0.292]
64 4 Other approximation methods Another way to get a more robust approximation for the statistics of f would be to use the deterministic sampling approach used in unscented transform [13] and consecutively in different unscented algorithms. [sent-105, score-0.695]
65 The unscented transform also ignores all the prior information on the form of the nonlinearity. [sent-107, score-0.241]
66 In an experiment of mean and log-variance approximation accuracy with a relatively large random MLP [12], the unscented transform needed over 100 % more time to achieve results with 10 times the mean squared error of the proposed approach. [sent-110, score-0.508]
67 This makes the problem easier as the inputs y(t) of the hidden neurons are Gaussian. [sent-113, score-0.241]
68 By using the cumulative Gaussian distribution or the error function erf as the activation function, the mean of the outputs of the hidden neurons and thus of the outputs of the whole network can be evaluated analytically. [sent-114, score-0.626]
69 The covariances still need to be evaluated numerically, and that is done by evaluating all the correlations of the hidden neurons separately. [sent-115, score-0.44]
70 In a network with H hidden neurons, this requires O(H 2 ) quadrature evaluations. [sent-116, score-0.37]
71 In our case the inputs of the hidden neurons are not Gaussian and hence even the error function as the activation function would not allow for exact evaluation of the means. [sent-117, score-0.346]
72 This is why we have decided to use the standard logistic sigmoid activation function in form of tanh which is more common and faster to evaluate numerically. [sent-118, score-0.177]
73 In our approach all the required means and variances can be evaluated with O(H) quadratures. [sent-119, score-0.291]
74 3 Experiments The proposed nonlinear factor analysis method was tested on natural speech data set consisting of spectrograms of 24 individual words of Finnish speech, spoken by 20 different speakers. [sent-120, score-0.496]
75 Taylor cost (nats / sample) Proposed cost (nats / sample) The data set was tested with two different learning algorithms for the NFA model, one based on the Taylor approximation introduced in [6] and another based on the proposed approximation. [sent-125, score-0.401]
76 Contrary to [6], the algorithm based on Taylor approximation used the same conjugate gradient based optimization algorithm as the new approximation. [sent-126, score-0.233]
77 This helped greatly in stabilizing the algorithm that used to be rather unstable with high source dimensionalities due to sensitivity of the Taylor approximation in regions where it is not really valid. [sent-127, score-0.175]
78 The number of hidden neurons in the MLP network was 40. [sent-129, score-0.246]
79 1 55 50 45 55 50 45 45 50 55 Reference cost (nats / sample) 45 50 55 Reference cost (nats / sample) Figure 1: The attained values of C in different simulations as evaluated by the different approximations plotted against reference values evaluated by sampling. [sent-131, score-0.689]
80 1 shows a comparison of the cost function values evaluated by the different approximations and a reference value evaluated by sampling. [sent-134, score-0.528]
81 The reference cost values were evaluated by sampling 400 points from the distribution q(S, θ f ), evaluating f (s, θ f ) at those points, and using the mean and variance of the output points in the cost function evaluation. [sent-135, score-0.689]
82 The standard deviation of the values was 5 · 10−3 nats per sample which should not show at all in the figures. [sent-137, score-0.197]
83 1 show that the proposed approximation yields consistently very 1 The Matlab code used in the experiments is available at http://www. [sent-141, score-0.205]
84 Both values are compared to reference values evaluated by sampling. [sent-148, score-0.252]
85 2, which shows the attained cost as a function of number of sources used. [sent-152, score-0.324]
86 The Taylor approximation shows a decrease in cost as the number of the sources increases even though the true cost is increasing rapidly. [sent-153, score-0.531]
87 4 Discussion The problem of estimating the statistics of a nonlinear transform of a probability distribution is also encountered in nonlinear extensions of Kalman filtering. [sent-155, score-0.557]
88 The Taylor approximation corresponds to extended Kalman filter and the new approximation can be seen as a modification of it with a more accurate linearization. [sent-156, score-0.331]
89 The proposed method is somewhat similar to unscented Kalman filtering based on the unscented transform [13], but much better suited for high dimensional MLP-like nonlinearities. [sent-158, score-0.475]
90 This is not very surprising, as worst case complexity of general Gaussian integration is exponential with respect to the dimensionality of the input [14] and unscented transform as a general method with linear complexity is bound to be less accurate in high dimensional problems. [sent-159, score-0.385]
91 In case of the MLP, the complexity of the unscented transform depends on the number of all weights, which in our case with 20 sources can be more than 2000. [sent-160, score-0.404]
92 5 Conclusions In this paper we have proposed a novel approximation method for unsupervised MLP networks in variational Bayesian learning. [sent-161, score-0.422]
93 The approximation is based on using numerical Gauss–Hermite quadratures to evaluate the global effect of the nonlinear activation function of the network to produce an effective linearization of the MLP. [sent-162, score-0.821]
94 The statistics of the outputs of the linearized network can be evaluated exactly to get accurate and reliable estimates of the statistics of the MLP outputs. [sent-163, score-0.413]
95 These can be used to evaluate the standard variational Bayesian ensemble learning cost function C and numerically minimize it using a hybrid fixed point / conjugate gradient algorithm. [sent-164, score-0.53]
96 We have demonstrated the method with a nonlinear factor analysis model and a real world speech data set. [sent-165, score-0.419]
97 The presented method can be used together with linear ICA for nonlinear BSS [7], and the approximation can be easily applied to more complex models such as nonlinear independent factor analysis [6] and nonlinear state-space models [9]. [sent-167, score-1.028]
98 An unsupervised ensemble learning method for nonlinear dynamic state-space models. [sent-230, score-0.356]
99 Approximating nonlinear transformations of probability distributions for nonlinear independent component analysis. [sent-247, score-0.525]
100 A general method for approximating nonlinear transformations of probability distributions. [sent-257, score-0.286]
wordName wordTfidf (topN-words)
[('mlp', 0.477), ('nonlinear', 0.245), ('quadrature', 0.196), ('erent', 0.185), ('unscented', 0.174), ('taylor', 0.167), ('nats', 0.167), ('variational', 0.165), ('sources', 0.163), ('evaluated', 0.162), ('yi', 0.161), ('di', 0.142), ('approximation', 0.14), ('quadratures', 0.137), ('hermite', 0.119), ('cost', 0.114), ('nfa', 0.11), ('gh', 0.109), ('gauss', 0.109), ('variance', 0.107), ('activation', 0.105), ('variances', 0.098), ('reference', 0.09), ('network', 0.089), ('factor', 0.086), ('sj', 0.085), ('hidden', 0.085), ('underestimation', 0.082), ('valpola', 0.082), ('bayesian', 0.082), ('sub', 0.081), ('ensemble', 0.074), ('neurons', 0.072), ('wk', 0.07), ('transform', 0.067), ('bss', 0.065), ('linearization', 0.065), ('kalman', 0.064), ('aij', 0.064), ('tk', 0.062), ('speech', 0.056), ('derivatives', 0.055), ('abscissas', 0.055), ('honkela', 0.055), ('linearize', 0.055), ('evaluating', 0.055), ('ective', 0.055), ('barber', 0.055), ('conjugate', 0.054), ('accurate', 0.051), ('latent', 0.051), ('cq', 0.048), ('cx', 0.048), ('posterior', 0.048), ('networks', 0.047), ('attained', 0.047), ('mean', 0.047), ('ects', 0.046), ('evaluation', 0.046), ('easier', 0.046), ('nonlinearity', 0.045), ('diagonal', 0.044), ('numerically', 0.044), ('factors', 0.044), ('spectrograms', 0.044), ('linearized', 0.044), ('karhunen', 0.044), ('gaussian', 0.042), ('approximating', 0.041), ('underestimate', 0.041), ('ica', 0.04), ('evaluate', 0.04), ('weights', 0.04), ('gradient', 0.039), ('cp', 0.038), ('inputs', 0.038), ('unsupervised', 0.037), ('worst', 0.036), ('independent', 0.035), ('unstable', 0.035), ('reliable', 0.034), ('covariances', 0.033), ('outputs', 0.033), ('analytically', 0.033), ('proposed', 0.033), ('correlations', 0.033), ('sigmoid', 0.032), ('analysis', 0.032), ('yields', 0.032), ('means', 0.031), ('sample', 0.03), ('bound', 0.03), ('log', 0.03), ('blind', 0.029), ('variables', 0.028), ('gure', 0.028), ('ect', 0.027), ('dimensional', 0.027), ('rbf', 0.027), ('perceptron', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
2 0.14287841 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
Author: Tobias Blaschke, Laurenz Wiskott
Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1
4 0.11257113 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
5 0.10642462 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
6 0.10593185 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
7 0.088263549 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
8 0.086532921 178 nips-2004-Support Vector Classification with Input Data Uncertainty
9 0.084415987 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.082045481 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
11 0.079310618 163 nips-2004-Semi-parametric Exponential Family PCA
12 0.077739514 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
13 0.076949961 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
14 0.070934139 50 nips-2004-Dependent Gaussian Processes
15 0.070458516 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons
16 0.068569794 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
17 0.06691923 82 nips-2004-Incremental Algorithms for Hierarchical Classification
18 0.06669239 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
19 0.064867973 44 nips-2004-Conditional Random Fields for Object Recognition
20 0.064112671 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
topicId topicWeight
[(0, -0.232), (1, -0.033), (2, -0.049), (3, -0.039), (4, -0.093), (5, -0.061), (6, 0.083), (7, 0.104), (8, 0.034), (9, -0.009), (10, 0.202), (11, -0.09), (12, 0.028), (13, -0.057), (14, -0.044), (15, 0.025), (16, -0.099), (17, 0.1), (18, 0.065), (19, -0.091), (20, -0.058), (21, -0.125), (22, -0.035), (23, 0.013), (24, -0.031), (25, 0.064), (26, 0.02), (27, -0.03), (28, 0.11), (29, -0.027), (30, -0.021), (31, 0.035), (32, -0.082), (33, 0.041), (34, -0.064), (35, -0.032), (36, -0.078), (37, 0.127), (38, 0.08), (39, -0.127), (40, -0.035), (41, 0.123), (42, 0.041), (43, 0.002), (44, -0.147), (45, -0.176), (46, -0.078), (47, -0.06), (48, 0.064), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.95795786 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
2 0.59680611 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
Author: Tobias Blaschke, Laurenz Wiskott
Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1
4 0.53464907 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
Author: Yoshitatsu Matsuda, Kazunori Yamaguchi
Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efficiently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efficient and effective in large-size natural image processing.
5 0.47853351 82 nips-2004-Incremental Algorithms for Hierarchical Classification
Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni
Abstract: We study the problem of hierarchical classification when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classifier for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classifier with a flat SVM classifier and with classifiers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classifier performs, after just one training epoch, nearly as well as the hierarchical SVM classifier (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classifier based on the true model parameters. 1 Introduction and basic definitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is specified as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classification have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classification tasks, and we derive the corresponding Bayes-optimal classifier under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reflects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classifier; 3) a simplified variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classifier based on the true model parameters. Our incremental algorithms are based on training linear-threshold classifiers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classifications as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classification system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a flat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are defined over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some fixed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are defined over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make fine-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is defined by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are fixed cost coefficients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N define its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model defines a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and defining fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this definition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a fixed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we define the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simplifies to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classifier for the H-loss We now describe a classifier, called H - BAYES, that is the Bayes-optimal classifier for the H-loss. In other words, H - BAYES classifies any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Define pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, fix any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively define H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classifier H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classification of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classifier by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classifier generated by these algorithms classifies an instance x through a multilabel y = (y1 , ..., yN ) defined as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classifiers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the first t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classifier (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The first algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are defined by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classifier of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates defined by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the definition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classifier by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simplified variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are defined as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The first dataset consists of the first (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the fifth and first group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a specific subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classification tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance figures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a significant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classification problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefficients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefficients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classifier trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classification scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We finally tested the “flat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefficients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the flat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a final remark, we observe that the normalized H-loss gave too small a range of values to afford fine comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefficients u1 , . . . , uN . We define the regret of a classifier assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classifier (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classifier learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we define the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modified H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classification of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classifier, and empirically compared an incremental approximation to this classifier with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simplified variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classifier is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classifier, rather than with respect to a classifier that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classification with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classification of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classification using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classifier. In Proceedings of the 13th European Conference on Artificial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.
6 0.46982694 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
7 0.44325998 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
8 0.44280231 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
9 0.4404543 50 nips-2004-Dependent Gaussian Processes
10 0.43391198 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
11 0.3986575 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
12 0.38226587 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
13 0.37683702 163 nips-2004-Semi-parametric Exponential Family PCA
14 0.35971016 130 nips-2004-Newscast EM
15 0.35897481 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
16 0.354689 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
17 0.35317928 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
18 0.33599618 166 nips-2004-Semi-supervised Learning via Gaussian Processes
19 0.33417299 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
20 0.32946685 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
topicId topicWeight
[(13, 0.114), (15, 0.12), (26, 0.035), (31, 0.023), (33, 0.166), (35, 0.018), (39, 0.391), (50, 0.018), (76, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.87379724 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
2 0.86732489 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
3 0.82674372 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
4 0.69105202 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
5 0.64984876 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
6 0.63699126 124 nips-2004-Multiple Alignment of Continuous Time Series
7 0.62943393 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
8 0.61345589 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.6125381 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
11 0.60436696 125 nips-2004-Multiple Relational Embedding
12 0.60392523 68 nips-2004-Face Detection --- Efficient and Rank Deficient
13 0.60310179 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
14 0.60156709 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
15 0.59609556 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
16 0.59592438 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
17 0.59197766 178 nips-2004-Support Vector Classification with Input Data Uncertainty
18 0.59140432 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
19 0.59135401 131 nips-2004-Non-Local Manifold Tangent Learning
20 0.59134674 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons