nips nips2001 nips2001-109 knowledge-graph by maker-knowledge-mining

109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions


Source: pdf

Author: Kari Torkkola

Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 net/torkkola Abstract The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. [sent-5, score-0.52]

2 However, it suffers from computational complexity proportional to the square of the number of samples in the training data. [sent-6, score-0.211]

3 We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). [sent-8, score-0.303]

4 This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. [sent-9, score-0.269]

5 Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible. [sent-10, score-0.191]

6 1 Introduction Feature selection or feature transforms are important aspects of any pattern recognition system. [sent-11, score-0.39]

7 Optimal feature selection coupled with a particular classifier can be done by actually training and evaluating the classifier using all combinations of available features. [sent-12, score-0.216]

8 Obviously this wrapper strategy does not allow learning feature transforms, because all possible transforms cannot be enumerated. [sent-13, score-0.331]

9 Both feature selection and feature transforms can be learned by evaluating some criterion that reflects the “importance” of a feature or a number of features jointly. [sent-14, score-0.742]

10 This is called the filter configuration in feature selection. [sent-15, score-0.108]

11 These are usually accompanied by a parametric estimation, such as Gaussian, of the densities at hand [6, 12]. [sent-18, score-0.078]

12 Maximizing a particular criterion, the joint mutual information (MI) between the features and the class labels [1, 17, 16, 13], can be shown to minimize the lower bound of the classification error [3, 10, 15]. [sent-21, score-0.389]

13 As a remedy, Principe et al showed in [4, 11, 10] that using Renyi’s entropy instead of Shannon’s, combined with Parzen density estimation, leads to expressions of mutual information with computational , where is the number of samples in the training set. [sent-24, score-0.557]

14 This method complexity of can be formulated to express the mutual information between continuous variables and discrete class labels in order to learn dimension-reducing feature transforms, both linear £ §¥ ¡ ¨¦¤£¢  [15] and non-linear [14], for pattern recognition. [sent-25, score-0.563]

15 One must note that regarding finding the extrema, both definitions of entropy are equivalent (see [7] pages 118,406, and [8] page 325). [sent-26, score-0.146]

16 This formulation of MI evaluates the effect of each sample to every other sample in the transformed space through the Parzen density estimation kernel. [sent-27, score-0.484]

17 Thus large/huge databases are hard to use due to the complexity. [sent-29, score-0.087]

18 In essence, Parzen density estimation is replaced by GMMs. [sent-31, score-0.193]

19 In order to evaluate the MI, evaluating mutual interactions between mixture components of the GMMs suffices, instead of having to evaluate interactions between all pairs of samples. [sent-32, score-0.616]

20 An approach that maps an output space GMM back to input space and again to output space through the adaptive feature transform is taken. [sent-33, score-0.846]

21 This allows all of the computation to take place in the target low dimensional space. [sent-34, score-0.127]

22 Computational complexity is reduced proportional to the square of the number of components in the mixtures. [sent-35, score-0.142]

23 An introduction is given to the maximum mutual information (MMI) formulation for discriminative feature transforms using Renyi entropy and Parzen density estimation. [sent-37, score-0.827]

24 We discuss different strategies to reduce its computational complexity, and we present a formulation based on GMMs. [sent-38, score-0.066]

25 Once that is done, we can perform gradient ascent on as follows B  #©  " (1) © © §  ¨¦ U W §& P I G E  ¥H¡ & FD © C § ©T6DB ¡ RSQ© C @ & 7 5 & #£ & ' & $ A£ 9$ 8 6% ! [sent-44, score-0.112]

26 Thus Renyi’s quadratic entropy can be computed as a sum of local interactions as defined by the kernel, over all pairs of samples. [sent-47, score-0.248]

27 In order to use this convenient property, a measure of mutual information making use of quadratic functions of the densities would be desirable. [sent-48, score-0.367]

28 Expressing densities as their Parzen estimates with kernel width results in § u “ ¦   § U¥ “ '& C §U¥ (4) ‘ b t‰ b t “'& ‰ C x  £ C¡ s s r  ‘bt bt x ‰ C¡  ‰  £ £ ! [sent-50, score-0.187]

29 It is now straightforward to derive partial which can accordingly be interpreted as an information force that other samples exert to sample . [sent-52, score-0.249]

30 The three components of the sum give rise to following three components of the information force: Samples within the same class attract each other, All samples regardless of class attract each other, and Samples of different classes repel each other. [sent-53, score-0.669]

31 This force, coupled with the latter factor inside the sum in (1), tends to change the transform in such a way that the samples in transformed space move into the direction of the information force, and thus increase the MI criterion . [sent-54, score-0.602]

32 Each term in (4) consists of a double sum of Gaussians evaluated using the pairwise distance between the samples. [sent-57, score-0.073]

33 The first component consists of a sum of these interactions within each class, the second of all interactions regardless of class, and the third of a sum of the interactions of each class against all other samples. [sent-58, score-0.417]

34 The bulk of computation consists of evaluating these Gaussians, and forming the sums of those. [sent-59, score-0.202]

35 Information force, the gradient of , makes use of the same Gaussians, in addition to pairwise differences of the samples [15]. [sent-60, score-0.245]

36 In essence, we are trying to learn a transform that minimizes the class density overlap in the output space while trying to drive each class into a singularity. [sent-64, score-0.757]

37 Since kernel density estimate results in a sum of kernels over samples, a divergence measure between the densities necessarily requires operations. [sent-65, score-0.242]

38 The only alternatives to reduce this complexity are either to reduce , or to form simpler density estimates. [sent-66, score-0.235]

39 In this case clustering needs to be performed in the high-dimensional input space, which may be difficult and computationally expensive itself. [sent-68, score-0.118]

40 A transform is then learned to find a representation that discriminates the cluster centers or the random samples belonging to different classes. [sent-69, score-0.38]

41 Details of the densities may be lost, more so with random sampling, but at least this might bring the problem down to a computable level. [sent-70, score-0.078]

42 A GMM is learned in the low-dimensional output space for each class, and now, instead of comparing samples against each other, comparing samples against the components of the GMMs suffices. [sent-72, score-0.562]

43 However, as the parameters of the transform are being learned iteratively, the will change at each iteration, and the GMMs need to be estimated again. [sent-73, score-0.25]

44 There is no guarantee that the change to the transform and to the is so small that simple re-estimation based on previous GMMs would suffice. [sent-74, score-0.21]

45 © C¦  •© C ¦ A further step in reducing computation is to compare GMMs of different classes in the output space against each other, instead of comparing the actual samples. [sent-76, score-0.271]

46 Nothing is being transformed by from the input space to the output space, such that we could change the transform in order to increase the MI criterion. [sent-78, score-0.572]

47 Although it would be possible now to evaluate the effect of each sample to each mixture component, and the effect of each component to the MI, that is, , due to the double summing, we will pursue the mapping strategy outlined in the following section. [sent-79, score-0.232]

48 If the GMM is available in the high-dimensional input space, those models can be directly mapped into the output space by the transform. [sent-81, score-0.252]

49 Writing the density of class as a GMM with mixture components and as their mixture weights we get £ u    £ ‘ b t    §  £ &  £ x § ¡  £  s P § £  § ¡ u (5)  We consider now only linear transforms. [sent-83, score-0.468]

50 The transformed density in the low-dimensional output space is then simply (6) §     £ & £   #"    ! [sent-84, score-0.438]

51  ‘ bt x C ¡  £   P § £  C ¡ u s Now, the mutual information in the output space between class labels and the densities as transformed GMMs can be expressed as a function of , and it will be possible to evaluate to insert into (1). [sent-85, score-0.875]

52 A great advantage of this strategy is that once the input space GMMs have been created (by the EM-algorithm, for example), the actual training data needs not be touched at all during optimization! [sent-86, score-0.163]

53 This is thus a very viable approach if the GMMs are already available in the high-dimensional input space (see Section 7), or if it is not too expensive or impossible to estimate them using the EM-algorithm. [sent-87, score-0.206]

54 An alternative is to construct a GMM model for the training data in the low-dimensional output space. [sent-90, score-0.126]

55 Since getting there requires a transform, the GMM is constructed after having transformed the data using, for example, a random or an informed guess as the transform. [sent-91, score-0.11]

56 Running the EM-algorithm in the input space is now unnecessary since we know which samples belong to which mixture components. [sent-93, score-0.349]

57 Similar strategy has been used to learn GMMs in high dimensional spaces [2]. [sent-94, score-0.134]

58 (5) to denote this density also in the input space. [sent-96, score-0.178]

59 As a result, we have GMMs in both spaces and a transform mapping between the two. [sent-97, score-0.276]

60 The transform can be learned as in the IO-mapping, by using the equalities and . [sent-98, score-0.311]

61 The biggest advantage is now avoiding to operate in the high-dimensional input space at all, not even the one time in the beginning of the procedure. [sent-100, score-0.126]

62  P £   5 Learning the Transform through Mapped GMMs We present now the derivation of adaptation equations for a linear transform that apply to either mapping. [sent-102, score-0.21]

63 The first step is to express the MI as a function of the GMM that is constructed in the output space. [sent-103, score-0.153]

64 This GMM is a function of the transform matrix , through the mapping of the input space GMM to the output space GMM. [sent-104, score-0.573]

65 The second step is to compute its gradient and to make use of it in the first half of Equation (1). [sent-105, score-0.079]

66 1 Information Potential as a Function of GMMs GMM in the output space for each class is already expressed in (7). [sent-107, score-0.295]

67 We need the following equalities: , where denotes the class prior, and . [sent-108, score-0.094]

68 "£ ¥ s s § ¥ ) £ s ¡ © r r © © r ¨¨¦ ©§¥ ¥ bt PA¨¨©¦§¥ ¥ £ P (¡ ¥ £ £ s r © r  (&g;¦ ¡ ¥ '&4 %# £  4 $‰ £ # 0 & ¦ ¡ 20P30%  7$ ‰   gx ‰   ¢   5. [sent-117, score-0.109]

69 2 Gradient of the Information Potential As each Gaussian mixture component is now a function of the corresponding input space component and the transform matrix , it is straightforward (albeit tedious) to write the gradient . [sent-118, score-0.508]

70 Since each of the three terms in is composed of different sums of , we need its gradient as   f ‰   ‰   & ‰    #"     P ‰   and §  ! [sent-119, score-0.116]

71 P    ‰ ‘‰  ‰ ‘ ‘ & p P §  ‰ ¢ &  ‰   ¡  p P § g¦ ¡ p where the input space GMM parameters are equalities and . [sent-122, score-0.187]

72 (10) with the  § & ¦ ¡ expresses the convolution of two mixture components in the output space. [sent-123, score-0.371]

73 As we also have those components in the high-dimensional input space, the gradient expresses how this convolution in the output space changes, as that maps the mixture components to the output space, is being changed. [sent-124, score-0.763]

74 The mutual information measure is defined in terms of these convolutions, and maximizing it tends to find a that (crudely stated) minimizes these convolutions between classes and maximizes them within classes. [sent-125, score-0.3]

75 The desired gradient of the Gaussian with respect to the transform matrix is as follows:  ‘    p " ¡U p ! [sent-126, score-0.289]

76 £¥ ‘  § (&g;¦ ¡ ‰   f ‰    # ¥ b   ‰  ¦¡¢ ‰  p   ‰ FU ¤¢ b ¡§ & ¦ ¡ x P § & ¦ ¡   x £   ‰ ¢    can now be obtained simply by replacing § ¨ ‰  ‘     p "¡U p ‘ p The total gradient by the above gradient. [sent-127, score-0.079]

77 (11) in (8) and (9) In evaluating , the bulk of computation is in evaluating the , the componentwise convolutions. [sent-128, score-0.239]

78 In addition, the requires pairwise sums and differences of the mixture parameters in the input space, but these need only be computed once. [sent-130, score-0.217]

79    § ¥ ¡ ¢  ¡ U 6 Empirical Results The first step in evaluating this approach is to compare its performance to the computationally more expensive MMI feature transforms that use Parzen density estimation. [sent-131, score-0.534]

80 To this end, we repeated the pattern recognition experiments of [15] using exactly the same LVQ-classifier. [sent-132, score-0.062]

81 These experiments were done using five publicly available databases that are very different in terms of the amount of data, dimension of data, and the number of training instances. [sent-133, score-0.137]

82 OIO-mapping was used with 3-5 diagonal Gaussians per class to learn a dimension-reducing linear transform. [sent-135, score-0.125]

83 As a figure of the overall performance, the average over all five databases and all reduced dimensions, which ranged from one up to the original dimension minus one, was 69. [sent-139, score-0.137]

84 For LDA this figure cannot be calculated since some databases had a small and LDA can only produce features. [sent-143, score-0.087]

85 $ ex 4 £ 4 £ 7 Discussion We have presented a method to learn discriminative feature transforms using Maximum Mutual Information as the criterion. [sent-146, score-0.393]

86 Output dimension PCA LDA MMI-Parzen MMI-GMM 1 41. [sent-177, score-0.05]

87 Output dimension PCA LDA MMI-Parzen MMI-GMM 1 4. [sent-203, score-0.05]

88 Output dimension PCA LDA MMI-Parzen MMI-GMM 1 41. [sent-229, score-0.05]

89 7 - Mixture Models as a semi-parametric density estimation method, allows all of the computation to take place in the low-dimensional transform space. [sent-271, score-0.466]

90 Compared to previous formulation using Parzen density estimation, large databases become now a possibility. [sent-272, score-0.252]

91 A convenient extension to Hidden Markov Models (HMM) as commonly used in speech recognition becomes also possible. [sent-273, score-0.191]

92 Given an HMM-based speech recognition system, the state discrimination can be enhanced by learning a linear transform from some highdimensional collection of features to a convenient dimension. [sent-274, score-0.445]

93 Existing HMMs can be converted to these high-dimensional features using so called single-pass retraining (compute all probabilities using current features, but do re-estimation using a the high-dimensional set of features). [sent-275, score-0.088]

94 Now a state-discriminative transform to a lower dimension can be learned using the method presented in this paper. [sent-276, score-0.3]

95 Another round of single-pass retraining then converts existing HMMs to new discriminative features. [sent-277, score-0.112]

96 A further advantage of the method in speech recognition is that the state separation in the transformed output space is measured in terms of the separability of the data represented as Gaussian mixtures, not in terms of the data itself (actual samples). [sent-278, score-0.437]

97 This should be advantageous regarding recognition accuracies since HMMs have the same exact structure. [sent-279, score-0.062]

98 Using mutual information for selecting features in supervised neural net learning. [sent-282, score-0.234]

99 Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. [sent-323, score-0.102]

100 Minimum bayes error feature selection for continuous speech recognition. [sent-344, score-0.206]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gmm', 0.406), ('gmms', 0.357), ('renyi', 0.247), ('transform', 0.21), ('parzen', 0.196), ('mutual', 0.19), ('transforms', 0.186), ('mi', 0.186), ('lda', 0.162), ('samples', 0.13), ('density', 0.127), ('output', 0.126), ('lvq', 0.119), ('entropy', 0.11), ('transformed', 0.11), ('bt', 0.109), ('feature', 0.108), ('class', 0.094), ('mixture', 0.093), ('databases', 0.087), ('force', 0.085), ('kari', 0.082), ('principe', 0.082), ('gradient', 0.079), ('densities', 0.078), ('space', 0.075), ('evaluating', 0.074), ('mmi', 0.072), ('pca', 0.068), ('discriminative', 0.068), ('interactions', 0.067), ('estimation', 0.066), ('convenient', 0.065), ('speech', 0.064), ('classi', 0.062), ('recognition', 0.062), ('components', 0.061), ('labels', 0.061), ('equalities', 0.061), ('gaussians', 0.059), ('july', 0.058), ('hmms', 0.058), ('shannon', 0.057), ('attract', 0.055), ('bulk', 0.055), ('heteroscedastic', 0.055), ('remedy', 0.055), ('torkkola', 0.055), ('convolution', 0.055), ('complexity', 0.052), ('input', 0.051), ('rs', 0.05), ('dimension', 0.05), ('regardless', 0.048), ('convolutions', 0.048), ('features', 0.044), ('retraining', 0.044), ('ax', 0.042), ('viable', 0.041), ('fx', 0.041), ('learned', 0.04), ('accuracy', 0.04), ('criterion', 0.04), ('expensive', 0.039), ('discriminant', 0.038), ('formulation', 0.038), ('sums', 0.037), ('strategy', 0.037), ('sum', 0.037), ('fisher', 0.037), ('expresses', 0.036), ('bhattacharyya', 0.036), ('mapping', 0.036), ('computation', 0.036), ('dimensional', 0.036), ('usa', 0.036), ('pairwise', 0.036), ('pages', 0.036), ('sample', 0.034), ('quadratic', 0.034), ('classes', 0.034), ('selection', 0.034), ('ascent', 0.033), ('essence', 0.033), ('evaluate', 0.032), ('learn', 0.031), ('spaces', 0.03), ('york', 0.029), ('dimensions', 0.029), ('table', 0.029), ('square', 0.029), ('low', 0.028), ('maximizing', 0.028), ('june', 0.028), ('reduce', 0.028), ('clustering', 0.028), ('theoretic', 0.028), ('express', 0.027), ('potential', 0.027), ('place', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

Author: Kari Torkkola

Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.

2 0.1594955 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

3 0.11092149 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

4 0.097915851 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

5 0.095994644 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models

Author: John R. Hershey, Michael Casey

Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1

6 0.092655629 172 nips-2001-Speech Recognition using SVMs

7 0.092493325 60 nips-2001-Discriminative Direction for Kernel Classifiers

8 0.087968171 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

9 0.086137757 107 nips-2001-Latent Dirichlet Allocation

10 0.082856297 43 nips-2001-Bayesian time series classification

11 0.081191845 74 nips-2001-Face Recognition Using Kernel Methods

12 0.078657329 61 nips-2001-Distribution of Mutual Information

13 0.078636147 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

14 0.076765366 46 nips-2001-Categorization by Learning and Combining Object Parts

15 0.073598579 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.072936855 134 nips-2001-On Kernel-Target Alignment

17 0.072540306 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

18 0.07228402 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

19 0.071798056 164 nips-2001-Sampling Techniques for Kernel Methods

20 0.071715787 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.245), (1, 0.069), (2, -0.048), (3, -0.04), (4, -0.095), (5, 0.047), (6, 0.049), (7, -0.007), (8, 0.003), (9, -0.032), (10, 0.064), (11, -0.039), (12, -0.021), (13, -0.071), (14, -0.107), (15, -0.018), (16, -0.015), (17, -0.006), (18, -0.094), (19, 0.011), (20, -0.028), (21, 0.093), (22, -0.033), (23, 0.012), (24, 0.066), (25, 0.026), (26, 0.114), (27, 0.151), (28, -0.029), (29, 0.059), (30, -0.196), (31, 0.024), (32, -0.08), (33, -0.161), (34, -0.082), (35, 0.076), (36, 0.031), (37, 0.042), (38, -0.143), (39, 0.053), (40, 0.04), (41, -0.02), (42, 0.031), (43, -0.041), (44, 0.063), (45, -0.02), (46, -0.054), (47, 0.164), (48, -0.039), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94788408 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

Author: Kari Torkkola

Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.

2 0.6159091 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

Author: Qi Zhang, Sally A. Goldman

Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1

3 0.5553441 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

4 0.51645291 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller

Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called

5 0.50899911 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

6 0.48365441 60 nips-2001-Discriminative Direction for Kernel Classifiers

7 0.47403812 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets

8 0.45173392 155 nips-2001-Quantizing Density Estimators

9 0.43573999 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

10 0.43253925 74 nips-2001-Face Recognition Using Kernel Methods

11 0.4214614 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

12 0.40384296 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

13 0.39705798 172 nips-2001-Speech Recognition using SVMs

14 0.39396372 190 nips-2001-Thin Junction Trees

15 0.38341659 61 nips-2001-Distribution of Mutual Information

16 0.38329449 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

17 0.38021067 143 nips-2001-PAC Generalization Bounds for Co-training

18 0.37641945 43 nips-2001-Bayesian time series classification

19 0.37343603 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

20 0.36885715 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.023), (17, 0.023), (19, 0.406), (27, 0.097), (30, 0.058), (38, 0.016), (59, 0.053), (72, 0.071), (79, 0.036), (91, 0.142)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99157536 93 nips-2001-Incremental A*

Author: S. Koenig, M. Likhachev

Abstract: Incremental search techniques find optimal solutions to series of similar search tasks much faster than is possible by solving each search task from scratch. While researchers have developed incremental versions of uninformed search methods, we develop an incremental version of A*. The first search of Lifelong Planning A* is the same as that of A* but all subsequent searches are much faster because it reuses those parts of the previous search tree that are identical to the new search tree. We then present experimental results that demonstrate the advantages of Lifelong Planning A* for simple route planning tasks. 1 Overview Artificial intelligence has investigated knowledge-based search techniques that allow one to solve search tasks in large domains. Most of the research on these methods has studied how to solve one-shot search problems. However, search is often a repetitive process, where one needs to solve a series of similar search tasks, for example, because the actual situation turns out to be slightly different from the one initially assumed or because the situation changes over time. An example for route planning tasks are changing traffic conditions. Thus, one needs to replan for the new situation, for example if one always wants to display the least time-consuming route from the airport to the conference center on a web page. In these situations, most search methods replan from scratch, that is, solve the search problems independently. Incremental search techniques share with case-based planning, plan adaptation, repair-based planning, and learning search-control knowledge the property that they find solutions to series of similar search tasks much faster than is possible by solving each search task from scratch. Incremental search techniques, however, differ from the other techniques in that the quality of their solutions is guaranteed to be as good as the quality of the solutions obtained by replanning from scratch. Although incremental search methods are not widely known in artificial intelligence and control, different researchers have developed incremental search versions of uninformed search methods in the algorithms literature. An overview can be found in [FMSN00]. We, on the other hand, develop an incremental version of A*, thus combining ideas from the algorithms literature and the artificial intelligence literature. We call the algorithm Lifelong Planning A* (LPA*), in analogy to “lifelong learning” [Thr98], because it reuses £ We thank Anthony Stentz for his support. The Intelligent Decision-Making Group is partly supported by NSF awards under contracts IIS9984827, IIS-0098807, and ITR/AP-0113881. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsoring organizations and agencies or the U.S. government. information from previous searches. LPA* uses heuristics to focus the search and always finds a shortest path for the current edge costs. The first search of LPA* is the same as that of A* but all subsequent searches are much faster. LPA* produces at least the search tree that A* builds. However, it achieves a substantial speedup over A* because it reuses those parts of the previous search tree that are identical to the new search tree. 2 The Route Planning Task Lifelong Planning A* (LPA*) solves the following search task: It applies to finite graph search problems on known graphs whose edge costs can increase or decrease over time. denotes the finite set of vertices of the graph. denotes the set of successors of vertex . Similarly, denotes the set of predecessors of vertex . denotes the cost of moving from vertex to vertex . LPA* always determines a shortest path from a given start vertex to a given goal vertex , knowing both the topology of the graph and the current edge costs. We use to denote the start distance of vertex , that is, the length of a shortest path from to .      ¨     ¨¦ £ £ ¡ ©§¥¤¢     FP HFE TSRQIGD¨  ¨¦ £ £ ¡    4 ©D¥CBA@!¨ ¨     ¨¦

2 0.97709262 124 nips-2001-Modeling the Modulatory Effect of Attention on Human Spatial Vision

Author: Laurent Itti, Jochen Braun, Christof Koch

Abstract: We present new simulation results , in which a computational model of interacting visual neurons simultaneously predicts the modulation of spatial vision thresholds by focal visual attention, for five dual-task human psychophysics experiments. This new study complements our previous findings that attention activates a winnertake-all competition among early visual neurons within one cortical hypercolumn. This

3 0.96006316 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

Author: Shun-ichi Amari, Hyeyoung Park, Tomoko Ozeki

Abstract: Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory such as AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator and the Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1

4 0.91043586 119 nips-2001-Means, Correlations and Bounds

Author: Martijn Leisink, Bert Kappen

Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1

same-paper 5 0.89768153 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

Author: Kari Torkkola

Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.

6 0.63583434 1 nips-2001-(Not) Bounding the True Error

7 0.59227854 13 nips-2001-A Natural Policy Gradient

8 0.58880615 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

9 0.58208072 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs

10 0.5710001 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

11 0.56991196 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

12 0.5592677 68 nips-2001-Entropy and Inference, Revisited

13 0.55455053 155 nips-2001-Quantizing Density Estimators

14 0.55073869 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

15 0.54881364 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

16 0.54793501 114 nips-2001-Learning from Infinite Data in Finite Time

17 0.5460099 36 nips-2001-Approximate Dynamic Programming via Linear Programming

18 0.5458467 54 nips-2001-Contextual Modulation of Target Saliency

19 0.54439485 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

20 0.54351979 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data