nips nips2004 nips2004-198 knowledge-graph by maker-knowledge-mining

198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

Source: pdf

Author: Antti Honkela, Harri Valpola

Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fi/projects/bayes/ Abstract In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. [sent-9, score-0.484]

2 The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. [sent-10, score-0.326]

3 This yields an accurate approximation for cases of large posterior variance. [sent-11, score-0.271]

4 The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. [sent-12, score-0.398]

5 This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. [sent-13, score-0.582]

6 1 Introduction Linear latent variable models such as factor analysis, principal component analysis (PCA) and independent component analysis (ICA) [1] are used in many applications ranging from engineering to social sciences and psychology. [sent-14, score-0.236]

7 In many of these cases, the eﬀect of the desired factors or sources to the observed data is, however, not linear. [sent-15, score-0.207]

8 The method presented in this paper can be used as a basis for many nonlinear latent variable models, such as nonlinear generalizations of the above models. [sent-17, score-0.541]

9 It is based on the variational Bayesian framework, which provides a solid foundation for nonlinear modeling that would otherwise be prone to overﬁtting [2]. [sent-18, score-0.41]

10 It also allows for easy comparison of diﬀerent model structures, which is even more important for ﬂexible nonlinear models than for simpler linear models. [sent-19, score-0.245]

11 General nonlinear generative models for data x(t) of the type x(t) = f (s(t), θ f ) + n(t) = Bφ(As(t) + a) + b + n(t) (1) often employ a multi-layer perceptron (MLP) (as in the equation) or a radial basis function (RBF) network to model the nonlinearity. [sent-20, score-0.361]

12 In context of variational Bayesian methods, RBF networks seem more popular of the two because it is easier to evaluate analytic expressions and bounds for certain key quantities [3]. [sent-22, score-0.298]

13 Most existing applications of variational Bayesian methods for nonlinear models are concerned with the supervised case where the inputs of the network are known and only the weights have to be learned [3, 5]. [sent-27, score-0.577]

14 This is easier as there are fewer parameters with related posterior variance above the nonlinear hidden layer and the distributions thus tend to be easier to handle. [sent-28, score-0.577]

15 In this paper we present a novel method for evaluating the statistics of the outputs of an MLP network in context of unsupervised variational Bayesian learning of its weights and inputs. [sent-29, score-0.419]

16 The method is demonstrated with a nonlinear factor analysis problem. [sent-30, score-0.363]

17 The nonlinearity (1) can be used as a building block of many diﬀerent models depending on the model assumed for the sources S. [sent-33, score-0.208]

18 Simple Gaussian prior on S leads to a nonlinear factor analysis (NFA) model [6, 7] that is studied here because of its simplicity. [sent-34, score-0.363]

19 The method could easily be extended with a mixture-of-Gaussians prior on S [8] to get a nonlinear independent factor analysis model, but this is omitted here. [sent-35, score-0.398]

20 In many nonlinear blind source separation (BSS) problems it is enough to apply simple NFA followed by linear ICA postprocessing to achieve nonlinear BSS [6, 7]. [sent-36, score-0.519]

21 In order to deal with the ﬂexible nonlinear models, a powerful learning paradigm resistant to overﬁtting is needed. [sent-38, score-0.245]

22 The variational Bayesian method of ensemble learning [2] has proven useful here. [sent-39, score-0.239]

23 Ensemble learning is based on approximating the true posterior p(S, θ|X) with a tractable approximation q(S, θ), typically a multivariate Gaussian with a diagonal covariance. [sent-40, score-0.273]

24 The approximation is ﬁtted to minimize the cost q(S, θ) C = log = D(q(S, θ)||p(S, θ|X)) − log p(X) (2) p(S, θ, X) where · denotes expectation over q(S, θ) and D(q||p) is the Kullback-Leibler divergence between q and p. [sent-41, score-0.314]

25 The cost can be evaluated analytically for a large class of mainly linear models [10, 11] leading to simple and eﬃcient learning algorithms. [sent-43, score-0.309]

26 1 Evaluating the cost Unfortunately, the cost (2) cannot be evaluated analytically for the nonlinear model (1). [sent-45, score-0.668]

27 t (3) The term Cx depends on the ﬁrst and second moments of f (s(t), θ f ) over the posterior approximation q(S, θ), and they cannot easily be evaluated analytically. [sent-47, score-0.35]

28 If the activation functions of the MLP network were linear, the output mean and variance could be evaluated exactly using only the mean and variance of the inputs s(t) and θ f . [sent-49, score-0.702]

29 Thus a natural ﬁrst approximation would be to linearize the network about the input mean using derivatives [6]. [sent-50, score-0.386]

30 Due to the local nature of the approximation, this can lead to severe underestimation of the variance, especially when the hidden neurons of the MLP network operate in the saturated region. [sent-52, score-0.328]

31 This makes the nonlinear factor analysis algorithm using this approach unstable with large number of factors because the posterior variance corresponding to the last factors is typically large. [sent-53, score-0.641]

32 To avoid this problem, we propose using a Gauss–Hermite quadrature to evaluate an eﬀective linearization of the nonlinear activation functions φ(yi (t)). [sent-54, score-0.651]

33 The Gauss– Hermite quadrature is a method for approximating weighted integrals ∞ f (x) exp(−x2 ) dx ≈ −∞ wk f (tk ), (5) k where the weights wk and abscissas tk are selected by requiring exact result for suitable number of low-order polynomials. [sent-55, score-0.534]

34 This allows evaluating the mean and variance of φ(yi (t)) by quadratures φ(yi (t))GH = wk φ y i (t) + tk yi (t) (6) k 2 wk φ y i (t) + tk φ(yi (t))GH = yi (t) − φ(yi (t))GH , (7) k respectively. [sent-56, score-0.932]

35 Here the weights and abscissas have been scaled to take into account the Gaussian pdf weight instead of exp(−x2 ), and y i (t) and yi (t) are the mean and variance of yi (t), respectively. [sent-57, score-0.571]

36 We used a three point quadrature that yields accurate enough results but can be evaluated quickly. [sent-58, score-0.441]

37 As both of the quadratures depend on φ at the same points, they can be evaluated together easily. [sent-62, score-0.299]

38 Using the approximation formula φ(yi (t)) = φ (yi (t))2 yi (t), the resulting mean and variance can be interpreted to yield an eﬀective linearization of φ(yi (t)) through φ(yi (t))GH . [sent-63, score-0.52]

39 (8) yi (t) The positive square root is used here because the derivative of the logistic sigmoid used as activation function is always positive. [sent-64, score-0.298]

40 (4), the exact mean and variance of the linearized model can be evaluated in a relatively straightforward manner. [sent-66, score-0.36]

41 Evaluation of the variance due to the sources requires propagating matrices through the network to track the correlations between the hidden units. [sent-67, score-0.477]

42 The same problem does not aﬀect the network weights as each parameter only aﬀects the value of one hidden neuron. [sent-69, score-0.214]

43 2 Details of the approximation The mean and variance of φ(yi (t)) depend on the distribution of yi (t). [sent-71, score-0.455]

44 The Gauss– Hermite quadrature assumes that yi (t) is Gaussian. [sent-72, score-0.357]

45 This is not true in our case, as the product of two independent normally distributed variables aij and sj (t) is super-Gaussian, although rather close to Gaussian if the mean of one of the variables is signiﬁcantly larger in absolute value than the standard deviation. [sent-73, score-0.287]

46 Ignoring the non-Gaussianity, the quadrature depends on the mean and variance of yi (t). [sent-75, score-0.511]

47 These can be evaluated exactly because of the linearity of the mapping as 2 Aij (sj (t)2 + sj (t)) + Aij sj (t) + ai , yi,tot (t) = (9) j where θ denotes the mean and θ the variance of θ. [sent-76, score-0.486]

48 In an experiment investigating the approximation accuracy with a random MLP [12], the Taylor approximation was found to underestimate the output variance by a factor of 400, at worst. [sent-79, score-0.514]

49 The worst case result of the above approximation was underestimation by a factor of 40, which is a great improvement over the Taylor approximation, but still far from perfect. [sent-80, score-0.344]

50 The worst case behavior could be improved to underestimation by a factor of 5 by introducing another quadrature evaluated with a diﬀerent variance for yi (t). [sent-81, score-0.83]

51 The diﬀerence in behavior of the two methods in more realistic cases is less drastic, but the version with two quadratures seems to provide more accurate approximations. [sent-83, score-0.188]

52 The more accurate approximation is implemented by evaluating another quadrature using the variance of yi (t) originating mainly from θ f , Aij (sj (t)2 + sj (t)) + ai , yi,weight (t) = (10) j and using the implied φ (yi (t)) in the evaluation of the eﬀects of these variances. [sent-84, score-0.841]

53 The total variance (9) is still used in evaluation of the means and the evaluation of the eﬀects of the variance of s(t). [sent-85, score-0.337]

54 3 Learning algorithm for nonlinear factor analysis The nonlinear factor analysis (NFA) model [6] is learned by numerically minimizing the cost C evaluated above. [sent-87, score-1.046]

55 The minimization algorithm is a combination of conjugate gradient for the means of S and θ f , ﬁxed point iteration for the variances of S and θ f , and EM like updates for other parameters and hyperparameters. [sent-88, score-0.222]

56 The ﬁxed point update algorithm for the variances follows from writing the cost function as a sum C = Cq + Cp = log q(S, θ) + − log p(S, θ, X) . [sent-89, score-0.272]

57 (11) A parameter θi that is assumed independent of others under q and has a Gaussian posterior approximation q(θi ) = N (θi ; θi , θi ), only aﬀects the corresponding negentropy term −1/2 log(2πeθi ) in Cq . [sent-90, score-0.223]

58 The required partial derivatives can be evaluated analytically with simple backpropagation like computations with the MLP network. [sent-94, score-0.25]

59 The quadratures used at hidden nodes lead to analytical expressions for the means and variances of the hidden nodes and the corresponding feedback gradients are easy to derive. [sent-95, score-0.436]

60 These derivatives can then be used in a conjugate gradient algorithm to update the means of S and θ f . [sent-97, score-0.179]

61 Due to the ﬂexibility of the MLP network and the gradient based learning algorithm, the nonlinear factor analysis method is sensitive to the initialization. [sent-98, score-0.491]

62 We have used linear PCA for initialization of the means of the sources S. [sent-99, score-0.194]

63 After this, the sources are kept ﬁxed for 20 iterations while only the network weights are updated. [sent-101, score-0.292]

64 4 Other approximation methods Another way to get a more robust approximation for the statistics of f would be to use the deterministic sampling approach used in unscented transform [13] and consecutively in diﬀerent unscented algorithms. [sent-105, score-0.695]

65 The unscented transform also ignores all the prior information on the form of the nonlinearity. [sent-107, score-0.241]

66 In an experiment of mean and log-variance approximation accuracy with a relatively large random MLP [12], the unscented transform needed over 100 % more time to achieve results with 10 times the mean squared error of the proposed approach. [sent-110, score-0.508]

67 This makes the problem easier as the inputs y(t) of the hidden neurons are Gaussian. [sent-113, score-0.241]

68 By using the cumulative Gaussian distribution or the error function erf as the activation function, the mean of the outputs of the hidden neurons and thus of the outputs of the whole network can be evaluated analytically. [sent-114, score-0.626]

69 The covariances still need to be evaluated numerically, and that is done by evaluating all the correlations of the hidden neurons separately. [sent-115, score-0.44]

70 In a network with H hidden neurons, this requires O(H 2 ) quadrature evaluations. [sent-116, score-0.37]

71 In our case the inputs of the hidden neurons are not Gaussian and hence even the error function as the activation function would not allow for exact evaluation of the means. [sent-117, score-0.346]

72 This is why we have decided to use the standard logistic sigmoid activation function in form of tanh which is more common and faster to evaluate numerically. [sent-118, score-0.177]

73 In our approach all the required means and variances can be evaluated with O(H) quadratures. [sent-119, score-0.291]

74 3 Experiments The proposed nonlinear factor analysis method was tested on natural speech data set consisting of spectrograms of 24 individual words of Finnish speech, spoken by 20 diﬀerent speakers. [sent-120, score-0.496]

75 Taylor cost (nats / sample) Proposed cost (nats / sample) The data set was tested with two diﬀerent learning algorithms for the NFA model, one based on the Taylor approximation introduced in [6] and another based on the proposed approximation. [sent-125, score-0.401]

76 Contrary to [6], the algorithm based on Taylor approximation used the same conjugate gradient based optimization algorithm as the new approximation. [sent-126, score-0.233]

77 This helped greatly in stabilizing the algorithm that used to be rather unstable with high source dimensionalities due to sensitivity of the Taylor approximation in regions where it is not really valid. [sent-127, score-0.175]

78 The number of hidden neurons in the MLP network was 40. [sent-129, score-0.246]

79 1 55 50 45 55 50 45 45 50 55 Reference cost (nats / sample) 45 50 55 Reference cost (nats / sample) Figure 1: The attained values of C in diﬀerent simulations as evaluated by the different approximations plotted against reference values evaluated by sampling. [sent-131, score-0.689]

80 1 shows a comparison of the cost function values evaluated by the diﬀerent approximations and a reference value evaluated by sampling. [sent-134, score-0.528]

81 The reference cost values were evaluated by sampling 400 points from the distribution q(S, θ f ), evaluating f (s, θ f ) at those points, and using the mean and variance of the output points in the cost function evaluation. [sent-135, score-0.689]

82 The standard deviation of the values was 5 · 10−3 nats per sample which should not show at all in the ﬁgures. [sent-137, score-0.197]

83 1 show that the proposed approximation yields consistently very 1 The Matlab code used in the experiments is available at http://www. [sent-141, score-0.205]

84 Both values are compared to reference values evaluated by sampling. [sent-148, score-0.252]

85 2, which shows the attained cost as a function of number of sources used. [sent-152, score-0.324]

86 The Taylor approximation shows a decrease in cost as the number of the sources increases even though the true cost is increasing rapidly. [sent-153, score-0.531]

87 4 Discussion The problem of estimating the statistics of a nonlinear transform of a probability distribution is also encountered in nonlinear extensions of Kalman ﬁltering. [sent-155, score-0.557]

88 The Taylor approximation corresponds to extended Kalman ﬁlter and the new approximation can be seen as a modiﬁcation of it with a more accurate linearization. [sent-156, score-0.331]

89 The proposed method is somewhat similar to unscented Kalman ﬁltering based on the unscented transform [13], but much better suited for high dimensional MLP-like nonlinearities. [sent-158, score-0.475]

90 This is not very surprising, as worst case complexity of general Gaussian integration is exponential with respect to the dimensionality of the input [14] and unscented transform as a general method with linear complexity is bound to be less accurate in high dimensional problems. [sent-159, score-0.385]

91 In case of the MLP, the complexity of the unscented transform depends on the number of all weights, which in our case with 20 sources can be more than 2000. [sent-160, score-0.404]

92 5 Conclusions In this paper we have proposed a novel approximation method for unsupervised MLP networks in variational Bayesian learning. [sent-161, score-0.422]

93 The approximation is based on using numerical Gauss–Hermite quadratures to evaluate the global eﬀect of the nonlinear activation function of the network to produce an eﬀective linearization of the MLP. [sent-162, score-0.821]

94 The statistics of the outputs of the linearized network can be evaluated exactly to get accurate and reliable estimates of the statistics of the MLP outputs. [sent-163, score-0.413]

95 These can be used to evaluate the standard variational Bayesian ensemble learning cost function C and numerically minimize it using a hybrid ﬁxed point / conjugate gradient algorithm. [sent-164, score-0.53]

96 We have demonstrated the method with a nonlinear factor analysis model and a real world speech data set. [sent-165, score-0.419]

97 The presented method can be used together with linear ICA for nonlinear BSS [7], and the approximation can be easily applied to more complex models such as nonlinear independent factor analysis [6] and nonlinear state-space models [9]. [sent-167, score-1.028]

98 An unsupervised ensemble learning method for nonlinear dynamic state-space models. [sent-230, score-0.356]

99 Approximating nonlinear transformations of probability distributions for nonlinear independent component analysis. [sent-247, score-0.525]

100 A general method for approximating nonlinear transformations of probability distributions. [sent-257, score-0.286]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mlp', 0.477), ('nonlinear', 0.245), ('quadrature', 0.196), ('erent', 0.185), ('unscented', 0.174), ('taylor', 0.167), ('nats', 0.167), ('variational', 0.165), ('sources', 0.163), ('evaluated', 0.162), ('yi', 0.161), ('di', 0.142), ('approximation', 0.14), ('quadratures', 0.137), ('hermite', 0.119), ('cost', 0.114), ('nfa', 0.11), ('gh', 0.109), ('gauss', 0.109), ('variance', 0.107), ('activation', 0.105), ('variances', 0.098), ('reference', 0.09), ('network', 0.089), ('factor', 0.086), ('sj', 0.085), ('hidden', 0.085), ('underestimation', 0.082), ('valpola', 0.082), ('bayesian', 0.082), ('sub', 0.081), ('ensemble', 0.074), ('neurons', 0.072), ('wk', 0.07), ('transform', 0.067), ('bss', 0.065), ('linearization', 0.065), ('kalman', 0.064), ('aij', 0.064), ('tk', 0.062), ('speech', 0.056), ('derivatives', 0.055), ('abscissas', 0.055), ('honkela', 0.055), ('linearize', 0.055), ('evaluating', 0.055), ('ective', 0.055), ('barber', 0.055), ('conjugate', 0.054), ('accurate', 0.051), ('latent', 0.051), ('cq', 0.048), ('cx', 0.048), ('posterior', 0.048), ('networks', 0.047), ('attained', 0.047), ('mean', 0.047), ('ects', 0.046), ('evaluation', 0.046), ('easier', 0.046), ('nonlinearity', 0.045), ('diagonal', 0.044), ('numerically', 0.044), ('factors', 0.044), ('spectrograms', 0.044), ('linearized', 0.044), ('karhunen', 0.044), ('gaussian', 0.042), ('approximating', 0.041), ('underestimate', 0.041), ('ica', 0.04), ('evaluate', 0.04), ('weights', 0.04), ('gradient', 0.039), ('cp', 0.038), ('inputs', 0.038), ('unsupervised', 0.037), ('worst', 0.036), ('independent', 0.035), ('unstable', 0.035), ('reliable', 0.034), ('covariances', 0.033), ('outputs', 0.033), ('analytically', 0.033), ('proposed', 0.033), ('correlations', 0.033), ('sigmoid', 0.032), ('analysis', 0.032), ('yields', 0.032), ('means', 0.031), ('sample', 0.03), ('bound', 0.03), ('log', 0.03), ('blind', 0.029), ('variables', 0.028), ('gure', 0.028), ('ect', 0.027), ('dimensional', 0.027), ('rbf', 0.027), ('perceptron', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

Author: Antti Honkela, Harri Valpola

2 0.14287841 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis

Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is signiﬁcant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1

3 0.13521935 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

Author: Tobias Blaschke, Laurenz Wiskott

Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1

4 0.11257113 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

Author: Rasmus K. Olsson, Lars K. Hansen

Abstract: We discuss an identiﬁcation framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing ﬁlters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artiﬁcial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.

5 0.10642462 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

6 0.10593185 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

7 0.088263549 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

8 0.086532921 178 nips-2004-Support Vector Classification with Input Data Uncertainty

9 0.084415987 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

10 0.082045481 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

11 0.079310618 163 nips-2004-Semi-parametric Exponential Family PCA

12 0.077739514 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

13 0.076949961 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

14 0.070934139 50 nips-2004-Dependent Gaussian Processes

15 0.070458516 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

16 0.068569794 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

17 0.06691923 82 nips-2004-Incremental Algorithms for Hierarchical Classification

18 0.06669239 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

19 0.064867973 44 nips-2004-Conditional Random Fields for Object Recognition

20 0.064112671 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.232), (1, -0.033), (2, -0.049), (3, -0.039), (4, -0.093), (5, -0.061), (6, 0.083), (7, 0.104), (8, 0.034), (9, -0.009), (10, 0.202), (11, -0.09), (12, 0.028), (13, -0.057), (14, -0.044), (15, 0.025), (16, -0.099), (17, 0.1), (18, 0.065), (19, -0.091), (20, -0.058), (21, -0.125), (22, -0.035), (23, 0.013), (24, -0.031), (25, 0.064), (26, 0.02), (27, -0.03), (28, 0.11), (29, -0.027), (30, -0.021), (31, 0.035), (32, -0.082), (33, 0.041), (34, -0.064), (35, -0.032), (36, -0.078), (37, 0.127), (38, 0.08), (39, -0.127), (40, -0.035), (41, 0.123), (42, 0.041), (43, 0.002), (44, -0.147), (45, -0.176), (46, -0.078), (47, -0.06), (48, 0.064), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95795786 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

Author: Antti Honkela, Harri Valpola

2 0.59680611 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis

3 0.57650197 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

Author: Tobias Blaschke, Laurenz Wiskott

4 0.53464907 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

Author: Yoshitatsu Matsuda, Kazunori Yamaguchi

Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efﬁciently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efﬁcient and effective in large-size natural image processing.

5 0.47853351 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

Abstract: We study the problem of hierarchical classiﬁcation when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classiﬁer for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classiﬁer with a ﬂat SVM classiﬁer and with classiﬁers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classiﬁer performs, after just one training epoch, nearly as well as the hierarchical SVM classiﬁer (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classiﬁer based on the true model parameters. 1 Introduction and basic deﬁnitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is speciﬁed as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classiﬁcation have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classiﬁcation tasks, and we derive the corresponding Bayes-optimal classiﬁer under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reﬂects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classiﬁer; 3) a simpliﬁed variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classiﬁer based on the true model parameters. Our incremental algorithms are based on training linear-threshold classiﬁers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classiﬁcations as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classiﬁcation system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a ﬂat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are deﬁned over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some ﬁxed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are deﬁned over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make ﬁne-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is deﬁned by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are ﬁxed cost coefﬁcients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N deﬁne its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model deﬁnes a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and deﬁning fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this deﬁnition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a ﬁxed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we deﬁne the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simpliﬁes to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classiﬁer for the H-loss We now describe a classiﬁer, called H - BAYES, that is the Bayes-optimal classiﬁer for the H-loss. In other words, H - BAYES classiﬁes any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Deﬁne pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, ﬁx any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively deﬁne H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classiﬁer H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classiﬁcation of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classiﬁer by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classiﬁer generated by these algorithms classiﬁes an instance x through a multilabel y = (y1 , ..., yN ) deﬁned as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classiﬁers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the ﬁrst t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classiﬁer (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The ﬁrst algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are deﬁned by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classiﬁer of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates deﬁned by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the deﬁnition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classiﬁer by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simpliﬁed variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are deﬁned as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The ﬁrst dataset consists of the ﬁrst (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the ﬁfth and ﬁrst group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a speciﬁc subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classiﬁcation tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance ﬁgures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a signiﬁcant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classiﬁcation problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefﬁcients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefﬁcients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classiﬁer trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classiﬁcation scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We ﬁnally tested the “ﬂat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefﬁcients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the ﬂat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a ﬁnal remark, we observe that the normalized H-loss gave too small a range of values to afford ﬁne comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefﬁcients u1 , . . . , uN . We deﬁne the regret of a classiﬁer assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classiﬁer (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classiﬁer learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we deﬁne the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modiﬁed H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classiﬁcation of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classiﬁer, and empirically compared an incremental approximation to this classiﬁer with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simpliﬁed variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classiﬁer is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classiﬁer, rather than with respect to a classiﬁer that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classiﬁers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classiﬁcation with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classiﬁcation. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classiﬁcation of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classiﬁcation using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classiﬁcation by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classiﬁer. In Proceedings of the 13th European Conference on Artiﬁcial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classiﬁcation and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

6 0.46982694 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

7 0.44325998 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

8 0.44280231 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

9 0.4404543 50 nips-2004-Dependent Gaussian Processes

10 0.43391198 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

11 0.3986575 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

12 0.38226587 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

13 0.37683702 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.35971016 130 nips-2004-Newscast EM

15 0.35897481 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

16 0.354689 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

17 0.35317928 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

18 0.33599618 166 nips-2004-Semi-supervised Learning via Gaussian Processes

19 0.33417299 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

20 0.32946685 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.114), (15, 0.12), (26, 0.035), (31, 0.023), (33, 0.166), (35, 0.018), (39, 0.391), (50, 0.018), (76, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87379724 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

Author: Antti Honkela, Harri Valpola

2 0.86732489 113 nips-2004-Maximum-Margin Matrix Factorization

Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola

Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-deﬁnite program, and discuss generalization error bounds for them. 1

3 0.82674372 42 nips-2004-Computing regularization paths for learning multiple kernels

Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan

Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classiﬁcation or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1

4 0.69105202 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

5 0.64984876 70 nips-2004-Following Curved Regularized Optimization Solution Paths

Author: Saharon Rosset

Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized ﬁtting is likely to lead to over-ﬁtted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efﬁcient and highly ﬂexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.

6 0.63699126 124 nips-2004-Multiple Alignment of Continuous Time Series

7 0.62943393 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

8 0.61345589 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

9 0.61277193 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

10 0.6125381 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

11 0.60436696 125 nips-2004-Multiple Relational Embedding

12 0.60392523 68 nips-2004-Face Detection --- Efficient and Rank Deficient

13 0.60310179 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

14 0.60156709 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

15 0.59609556 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

16 0.59592438 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

17 0.59197766 178 nips-2004-Support Vector Classification with Input Data Uncertainty

18 0.59140432 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

19 0.59135401 131 nips-2004-Non-Local Manifold Tangent Learning

20 0.59134674 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons