nips nips2002 nips2002-53 knowledge-graph by maker-knowledge-mining

53 nips-2002-Clustering with the Fisher Score

Source: pdf

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences).

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp, Abstract Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. [sent-10, score-0.47]

2 The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. [sent-11, score-0.31]

3 This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. [sent-12, score-1.192]

4 When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. [sent-13, score-0.31]

5 So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. [sent-14, score-0.295]

6 This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences). [sent-15, score-0.108]

7 Among them, discrete data such as biological sequences [2] are especially challenging, because efﬁcient clustering algorithms e. [sent-17, score-0.274]

8 In such cases, one naturally considers to map data to a vector space and perform clustering there. [sent-20, score-0.229]

9 Recently, the Fisher score has been successfully applied as a feature extractor in supervised classiﬁcation [5, 15, 14, 13, 16]. [sent-22, score-0.537]

10 The Fisher score is derived as follows: Let us assume that a probabilistic model is available. [sent-23, score-0.31]

11 Given a parameter estimate from training samples, the Fisher score vector is obtained as ! [sent-24, score-0.284]

12 For successful clustering with the Fisher score, one has to investigate how original classes are mapped into the feature space, and select a proper clustering algorithm. [sent-27, score-0.519]

13 In this paper, it will be claimed that the Fisher score has only a few dimensions which contains the class information and a lot of unnecessary nuisance dimensions. [sent-28, score-1.079]

14 So K-Means type clustering [6] is obviously inappropriate because it takes all dimensions into account. [sent-29, score-0.452]

15 We will propose a clustering method specialized for the Fisher score, which exploits important dimensions with class information. [sent-30, score-0.495]

16 This method has an efﬁcient EM-like alternating procedure to learn, and has the favorable property that the resultant clusters are invariant to any invertible linear transformation. [sent-31, score-0.219]

17 Denote by the domain of objects (discrete or continuous) and by the set of class labels. [sent-34, score-0.058]

18 Let be the underlying joint distribution and assume that the class distributions are well separated, i. [sent-36, score-0.058]

19 ¦ 8¨ % % D¦ ¨ § ¦ £ ¡ © ¨¦¥ ¦A@ B@A@©¨¤¢% F §8D¨¦ # First of all, let us assume that the marginal distribution is known. [sent-39, score-0.112]

20 Then the problem is how to ﬁnd a good feature extractor, which can preserve class information, based on the prior knowledge of . [sent-40, score-0.119]

21 This problem is by no means trivial, since it is in general hard to infer anything about the possible from the marginal without additional assumptions [12]. [sent-42, score-0.11]

22 ¨¦ $© ¦ 8¨ F 6 ©§¥ ¨¦ A loss function for feature extraction In order to investigate how the cluster structure is preserved, we ﬁrst have to deﬁne what the class information is. [sent-44, score-0.268]

23 We regard that the class information is completely preserved, if a set of predictors in the feature space can recover . [sent-45, score-0.198]

24 This view makes sense, because it is impossible the true posterior probability to recover the posteriors when classes are totally mixed up. [sent-46, score-0.111]

25 As a predictor of posterior probability in the feature space, we adopt the simplest one, i. [sent-47, score-0.061]

26 Thus the loss of feature extractor for -th class is deﬁned as # # ¥ p q% ¦ # 7 ¦ e h g f% # ¦ # # 9 i# ¨ $©¦ 7 7 V T S F 9H D G c ¦ Y) ' dRb8¨ % ! [sent-51, score-0.275]

27 a`0X$©¨¦ H % WU¤RDQ PIBA@¤8F EC % & # ¦ # 7 #) #6 where denote the expectation with the true marginal distribution is just the sum over all classes . [sent-52, score-0.112]

28 The overall loss TS Even when the full class information is preserved, i. [sent-55, score-0.089]

29 , clustering in the feature space may not be easy, because of nuisance dimensions which do not contribute to clustering at all. [sent-57, score-1.234]

30 The posterior predictors make use of an at most dimensional subspace out of the -dimensional Fisher score, and the complementary subspace may not have any information about classes. [sent-58, score-0.063]

31 K-means type methods [6] assume a cluster to be hyperspherical, which means that every dimension should contribute to cluster discrimination. [sent-59, score-0.244]

32 When nuisance dimensions cannot be excluded, we will need a different clustering method that is robust to nuisance dimensions. [sent-61, score-1.469]

33 First, a simple but unrealistic example is shown to achieve , without producing nuisance dimensions at all. [sent-65, score-0.691]

34 Let us assume that is determined as a mixture model of true class distributions: ¨¦ 6 ©§¥ p % 5C # ¦ 9 ¨¦ 6 ©§¥ 9 @ 4£fYe¦BAB4d% # x % % ! [sent-66, score-0.129]

35 % # x ¥ @ @ @ £ © ¦ §£ $Y`¥A@B@A@ % p 8x ©¨4£¦ £ © u¡ # x 99 ih¦v#g e 5d% x A¥ % ¨© # x9 ¦vhigw# £¦ ( % ©¨ # y9 ¦vihgw# % u ©¨¦ ts¥ ¦ Y ' ¦ x where the true marginal distribution (2. [sent-67, score-0.112]

36 (proof) To prove the lemma, it is sufﬁcient to show the existence of and dimensional vector such that The Fisher score for Let and matrix ¦ % ¦ # % ¦ 7 When the Fisher score is derived at the true parameter, it achieves Lemma 1. [sent-75, score-0.688]

37 Loose Models and Nuisance Dimensions We assumed that is known but still we do not know the true class distributions . [sent-80, score-0.093]

38 In the following, the result of Lemma 1 is relaxed to a more general class of probability models by means of the chain rule of derivatives. [sent-82, score-0.108]

39 However, in this case, we have to pay the price: nuisance dimensions. [sent-83, score-0.549]

40 Let denote the manifold of : Now the question is how to determine a manifold such that , which is answered by the following theorem. [sent-86, score-0.066]

41 Assume that the true distribution is contained in : 4 % ¨ $©¦ 8¨D¦ # 4 u D¨¦ s ¥ ¨ p % ¡ %# ¦ 6 ©§¥ ¥ ¨¦ % 473 ¨¦ 6 ©§¥ ¨¦ % ¨ ©§¥ &8D¦ 4 4 where is the true parameter. [sent-88, score-0.094]

42 If the tangent space of at contains the tangent space of at the same point (Fig. [sent-89, score-0.146]

43 1), then the Fisher score derived from satisﬁes r Y ¥ ¦ ¤ £ @ § 3 1 42) p # 3 7 % ¦ £ Y 0¥ I ¦vg v 2(% § C v % ¦ 9 9 £ Y £ ¦ 5 I v¦g )Y1IHI$ ¨ 9 0¥ % ¦ AA@A@B@9 $ ¨ % ! [sent-90, score-0.284]

44 5) is rewritten as and matrix ¦ D§¥ ¨¦ Let is contained in that of When the tangent space of by the chain rule: (proof) To prove the theorem, it is sufﬁcient to show the existence of and dimensional vector such that . [sent-95, score-0.182]

45 5) p % & ¦ Figure 2: Feature space constructed by the Fisher score from the samples with two distinct clusters. [sent-97, score-0.317]

46 The and -axis corresponds to an nuisance and an important dimension, respectively. [sent-98, score-0.583]

47 ¨ Nuisance 4 Important 7 Q p # x Figure 1: Information geometric picture of a probabilistic model whose Fisher score can fully extract the class information. [sent-100, score-0.423]

48 When the tangent space of is contained in , the Fisher score can fully extract the class information, i. [sent-101, score-0.494]

49 3 M In determination of , we face the following dilemma: For capturing important dimensions (i. [sent-105, score-0.176]

50 the tangent space of ), the number of parameters should be sufﬁciently larger than . [sent-107, score-0.073]

51 But a large leads to a lot of nuisance dimensions, which are harmful for clustering in the feature space. [sent-108, score-0.885]

52 However, in supervised scenarios, the existence of nuisance dimensions is not a serious problem, because advanced supervised classiﬁers such as the support vector machine have a built-in feature selector [7]. [sent-110, score-0.936]

53 However in unsupervised scenarios without class labels, it is much more difﬁcult to ignore nuisance dimensions. [sent-111, score-0.662]

54 2 shows how the feature space looks like, when the number of clusters is two and only one nuisance dimension is involved. [sent-113, score-0.752]

55 Projected on the important dimension, clusters will be concentrated into two distinct points. [sent-114, score-0.153]

56 However, when the Euclidean distance is adopted as in K-Means, it is difﬁcult to recover true clusters because two “lines” are close to each other. [sent-115, score-0.228]

57 r ¨¦ 6 D§¥ 3 r ¥ 3 Clustering Algorithm for the Fisher score In this section, we will develop a new clustering algorithm for the Fisher score. [sent-116, score-0.513]

58 Let be a set of class labels assigned to , respectively. [sent-117, score-0.058]

59 As mensioned before, pose of clustering is to obtain in clustering with the Fisher score, it is necessary to capture important dimensions. [sent-119, score-0.492]

60 However, from the last section’s analysis, we know more than nongaussianity about important dimensions of the Fisher score. [sent-123, score-0.176]

61 , there are bases in the space of the Fisher score, such that the samples in the -th cluster are projected close to 1 on the -th basis and the others are projected close to 0. [sent-135, score-0.261]

62 The objective function of our clustering algorithm is designed to detect such bases: p # % ¦ (3. [sent-136, score-0.259]

63 When linear transformation is notoriously set, § ' EU$©¨¦ # I¦ % $©¦ ¨ # K-means can end up with a false result which may not reﬂect the underlying structure. [sent-142, score-0.056]

64 ¡ r ¦ 4 Clustering Artiﬁcial Data We will perform a clustering experiment with artiﬁcially generated data (Fig. [sent-177, score-0.229]

65 Since this data has a complicated structure, the Gaussian mixture with components is used as a probabilistic model for the Fisher score: where denotes the Gaussian distribution with mean and covariance matrix . [sent-179, score-0.095]

66 The parameters are learned with the EM algorithm and the marginal distribution is accurately estimated as shown in Fig. [sent-180, score-0.077]

67 We applied the proposed algorithm and K-Means to the Fisher score calculated by taking derivatives with respect to . [sent-182, score-0.284]

68 3, initial clusters are constructed by randomly combining these subclusters. [sent-185, score-0.144]

69 For each method, we chose the best result which achieved the minimum loss among the local minima obtained from 100 clustering experiments. [sent-186, score-0.331]

70 As a result, the proposed method obtained clearly separated clusters (Fig. [sent-187, score-0.149]

71 3, upper right) but K-Means failed to recover the “correct” clusters, which is considered as the effect of nuisance dimensions (Fig. [sent-188, score-0.741]

72 [9, 8]) ¡ D¨¦ ¡ % ¦ ¡ %©¨¦ # 9 h e &6 ©¨§¥ 1 When the covariance matrix of each cluster is allowed to be different in K-Means, it becomes invariant to normalization. [sent-197, score-0.114]

73 However this method in turn causes singularities, where a cluster shrinks to the delta distribution, and difﬁcult to learn in high dimensional spaces. [sent-198, score-0.121]

74 Standard mixture modeling methods have difﬁculties in modeling such complicated cluster shapes [9, 10]. [sent-206, score-0.156]

75 One straightforward way is to model each cluster as a Gaussian Mixture: However, special care needs to be taken for such a “mixture of mixtures” problem. [sent-207, score-0.087]

76 One more important feature of gyrB is its capability of being an evolutionary and taxonomic marker alternating popular 16S rRNA [17]. [sent-218, score-0.138]

77 Our data set consists of 55 amino acid sequences containing three clusters (9,32,14). [sent-219, score-0.346]

78 The three clusters correspond to three genera of high GC-content gram-positive bacteria, that is, Corynebacteria, Mycobacteria, Rhodococcus, respectively. [sent-220, score-0.119]

79 Each sequence is represented as a sequence of 20 characters, each of which represents an amino acid. [sent-221, score-0.161]

80 The length of each sequence is different from 408 to 442, which makes it difﬁcult to convert a sequence into a vector of ﬁxed dimensionality. [sent-222, score-0.062]

81 Let be the obtained clusters and be the ground truth clusters. [sent-224, score-0.119]

82 Also let and be the number of samples in and , respectively. [sent-226, score-0.068]

83 When the two partitions are exactly the same, ARI is 1, and the expected value of ARI over random partitions is 0 (see [4] for details). [sent-235, score-0.104]

84 Notice that we did not perform any normalization to the Fisher score vectors. [sent-242, score-0.284]

85 In order to avoid local minima, we tried 1000 different initial values and chose the one which achieved the minimum loss both in K-means and our method. [sent-243, score-0.079]

86 754) when the number of HMM states is 3, which shows that important dimensions are successfully discovered from the “sea” of nuisance dimensions. [sent-251, score-0.75]

87 In contrast, the ARI of K-Means decreases monotonically as the number of HMM states increases, which shows the K-Means is not robust against nuisance dimensions. [sent-252, score-0.549]

88 But when the number of nuisance dimensions are too many (i. [sent-253, score-0.691]

89 ), our method was caught in false clusters which happened to appear in nuisance dimensions. [sent-255, score-0.668]

90 ¥¤ £C ¢ % 6 Concluding Remarks In this paper, we illustrated how the class information is encoded in the Fisher score: most information is packed in a few dimensions and there are a lot of nuisance dimensions. [sent-258, score-0.795]

91 Advanced supervised classiﬁers such as the support vector machine have a built-in feature selector [7], so they can detect important dimensions automatically. [sent-259, score-0.359]

92 However in unsupervised learning, it is not easy to detect important dimensions because of the lack of class labels. [sent-260, score-0.264]

93 We proposed a novel very simple clustering algorithm that can ignore nuisance dimensions and tested it in artiﬁcial and real data experiments. [sent-261, score-0.943]

94 An interesting aspect of our gyrB experiment is that the ideal scenario assumed in the theory section is not fulﬁlled anymore as clusters might overlap. [sent-262, score-0.119]

95 The Fisher score derives features using the prior knowledge of the marginal distribution. [sent-264, score-0.361]

96 In general, it is impossible to infer anything about the conditional distribution from the marginal [12] without any further assumptions. [sent-265, score-0.136]

97 However, when one knows the directions that the marginal distribution can move (i. [sent-266, score-0.077]

98 the model of marginal distribution), it is possible to extract information about , even though it may be corrupted by many nuisance dimensions. [sent-268, score-0.655]

99 ¦ 8¨ Acknowledgement The authors gratefully acknowledge that the bacterial gyrB amino acid sequences are offered by courtesy of Identiﬁcation and Classiﬁcation of Bacteria (ICB) database team [17]. [sent-273, score-0.297]

100 A new discriminative a u kernel from probabilistic models. [sent-405, score-0.077]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nuisance', 0.549), ('fisher', 0.469), ('score', 0.284), ('clustering', 0.229), ('ari', 0.159), ('dimensions', 0.142), ('gyrb', 0.13), ('extractor', 0.125), ('clusters', 0.119), ('amino', 0.099), ('vg', 0.088), ('cluster', 0.087), ('hmm', 0.084), ('acid', 0.083), ('ih', 0.082), ('marginal', 0.077), ('tangent', 0.073), ('preserved', 0.067), ('feature', 0.061), ('tsuda', 0.059), ('class', 0.058), ('classi', 0.052), ('partitions', 0.052), ('recover', 0.05), ('baa', 0.05), ('baab', 0.05), ('lowerright', 0.05), ('selector', 0.05), ('sonnenburg', 0.05), ('upperleft', 0.05), ('tsch', 0.048), ('lot', 0.046), ('sequences', 0.045), ('rand', 0.043), ('bacteria', 0.043), ('bacterial', 0.043), ('alternating', 0.043), ('supervised', 0.042), ('whitened', 0.04), ('icb', 0.04), ('kawanabe', 0.04), ('potsdam', 0.04), ('lemma', 0.039), ('mixture', 0.036), ('true', 0.035), ('let', 0.035), ('dif', 0.035), ('roberts', 0.035), ('important', 0.034), ('dimensional', 0.034), ('samples', 0.033), ('anything', 0.033), ('bases', 0.033), ('manifold', 0.033), ('complicated', 0.033), ('cult', 0.032), ('gc', 0.032), ('specialized', 0.032), ('scenarios', 0.032), ('fix', 0.032), ('nucleic', 0.032), ('extraction', 0.031), ('sequence', 0.031), ('transformation', 0.031), ('loss', 0.031), ('invertible', 0.03), ('separated', 0.03), ('projected', 0.03), ('cation', 0.03), ('detect', 0.03), ('dna', 0.029), ('predictors', 0.029), ('inappropriate', 0.029), ('obviously', 0.029), ('extract', 0.029), ('kernel', 0.028), ('lled', 0.027), ('database', 0.027), ('invariant', 0.027), ('probabilistic', 0.026), ('impossible', 0.026), ('existence', 0.026), ('fully', 0.026), ('arti', 0.026), ('chain', 0.025), ('successfully', 0.025), ('achieves', 0.025), ('initial', 0.025), ('result', 0.025), ('close', 0.024), ('contribute', 0.024), ('advanced', 0.024), ('contained', 0.024), ('ignore', 0.023), ('minimum', 0.023), ('minima', 0.023), ('adjusted', 0.023), ('discriminative', 0.023), ('dimension', 0.023), ('type', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

2 0.21579269 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1

3 0.1827646 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very diﬃcult to develop a uniﬁed framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the diﬃculty in ﬁnding such a uniﬁcation, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-oﬀs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

4 0.17367074 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1

5 0.14815173 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie

Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classiﬁcation problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efﬁciently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classiﬁer performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢

6 0.14559725 90 nips-2002-Feature Selection in Mixture-Based Clustering

7 0.12766883 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

8 0.12388664 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

9 0.11917821 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

10 0.11035647 67 nips-2002-Discriminative Binaural Sound Localization

11 0.10681292 188 nips-2002-Stability-Based Model Selection

12 0.10626049 113 nips-2002-Information Diffusion Kernels

13 0.096893586 83 nips-2002-Extracting Relevant Structures with Side Information

14 0.078438252 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

15 0.075045794 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.070802703 89 nips-2002-Feature Selection by Maximum Marginal Diversity

17 0.070782162 87 nips-2002-Fast Transformation-Invariant Factor Analysis

18 0.069365121 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs

19 0.068148911 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

20 0.068119444 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.221), (1, -0.112), (2, 0.068), (3, -0.036), (4, -0.135), (5, 0.049), (6, -0.056), (7, -0.148), (8, -0.083), (9, 0.153), (10, 0.086), (11, 0.127), (12, -0.104), (13, 0.175), (14, -0.105), (15, -0.073), (16, -0.026), (17, -0.004), (18, -0.047), (19, 0.108), (20, 0.041), (21, -0.153), (22, -0.11), (23, 0.154), (24, 0.002), (25, 0.091), (26, -0.089), (27, 0.042), (28, -0.005), (29, -0.152), (30, -0.032), (31, 0.053), (32, -0.104), (33, -0.209), (34, 0.111), (35, 0.013), (36, -0.013), (37, -0.077), (38, 0.049), (39, -0.092), (40, 0.032), (41, -0.05), (42, -0.015), (43, -0.098), (44, -0.066), (45, 0.006), (46, -0.036), (47, 0.126), (48, -0.016), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9497515 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

2 0.61361241 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

3 0.59206074 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

4 0.55739546 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

5 0.5058158 188 nips-2002-Stability-Based Model Selection

Author: Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann

Abstract: Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a speciﬁc learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area.

6 0.50458896 90 nips-2002-Feature Selection in Mixture-Based Clustering

7 0.48053163 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

8 0.47978604 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

9 0.458161 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

10 0.44176024 67 nips-2002-Discriminative Binaural Sound Localization

11 0.44082198 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

12 0.4368028 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

13 0.38974056 89 nips-2002-Feature Selection by Maximum Marginal Diversity

14 0.36166856 113 nips-2002-Information Diffusion Kernels

15 0.33495244 83 nips-2002-Extracting Relevant Structures with Side Information

16 0.31517267 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

17 0.31483123 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

18 0.29552153 87 nips-2002-Fast Transformation-Invariant Factor Analysis

19 0.28863442 114 nips-2002-Information Regularization with Partially Labeled Data

20 0.27620637 167 nips-2002-Rational Kernels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.016), (14, 0.017), (42, 0.079), (54, 0.152), (55, 0.045), (67, 0.016), (68, 0.029), (74, 0.124), (77, 0.157), (79, 0.025), (87, 0.04), (92, 0.06), (98, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95176548 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

Author: Eric P. Xing, Michael I. Jordan, Richard M. Karp, Stuart Russell

Abstract: We propose a dynamic Bayesian model for motifs in biopolymer sequences which captures rich biological prior knowledge and positional dependencies in motif structure in a principled way. Our model posits that the position-speciﬁc multinomial parameters for monomer distribution are distributed as a latent Dirichlet-mixture random variable, and the position-speciﬁc Dirichlet component is determined by a hidden Markov process. Model parameters can be ﬁt on training motifs using a variational EM algorithm within an empirical Bayesian framework. Variational inference is also used for detecting hidden motifs. Our model improves over previous models that ignore biological priors and positional dependence. It has much higher sensitivity to motifs during detection and a notable ability to distinguish genuine motifs from false recurring patterns.

same-paper 2 0.89649141 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

3 0.83020592 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some ﬁlter outputs. We encourage the system to ﬁnd sparse features by using a Studentt distribution to model each ﬁlter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent ﬁlters, the system learns a topographic map in which the orientation, spatial frequency and location of the ﬁlters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efﬁcient learning procedure that works well even for highly overcomplete sets of ﬁlters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener ﬁlter” for the purpose of denoising images.

4 0.82891971 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

5 0.827609 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

6 0.82704008 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

7 0.82673609 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

8 0.82607555 124 nips-2002-Learning Graphical Models with Mercer Kernels

9 0.82584298 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

10 0.82582211 27 nips-2002-An Impossibility Theorem for Clustering

11 0.82535976 10 nips-2002-A Model for Learning Variance Components of Natural Images

12 0.82242811 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

13 0.82156318 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

14 0.82075441 2 nips-2002-A Bilinear Model for Sparse Coding

15 0.82056469 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.81897682 188 nips-2002-Stability-Based Model Selection

17 0.81859362 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

18 0.81759298 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

19 0.817173 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

20 0.81715584 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction