nips nips2002 nips2002-191 knowledge-graph by maker-knowledge-mining

191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Source: pdf

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Craig Saunders Abstract In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. [sent-4, score-0.81]

2 The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. [sent-5, score-0.935]

3 Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. [sent-6, score-0.427]

4 We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. [sent-7, score-0.164]

5 By adjusting the parametrisation we can also influence the weighting received by the features . [sent-8, score-0.308]

6 In this way we are able to obtain a logarithmic weighting in a Fisher kernel. [sent-9, score-0.179]

7 Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. [sent-10, score-0.328]

8 1 Introduction Recently the string kernel [6] has been shown to achieve good performance on textcategorisation tasks . [sent-11, score-0.546]

9 The string kernel projects documents into a feature space indexed by all k-tuples of symbols for some fixed k. [sent-12, score-0.859]

10 , Uk) for a document d is the sum over all occurrences of U as a subsequence (not necessarily contiguous) in d, where each occurrence is weighted by an exponentially decaying function of its length in d. [sent-16, score-0.253]

11 This naturally extends the idea of an n-gram feature space where the only occurrences considered are contiguous ones. [sent-17, score-0.147]

12 The dimension of the feature space and the non-sparsity of even modestly sized documents makes a direct computation of the feature vector for the string kernel infeasible. [sent-18, score-0.732]

13 There is, however, a dynamic programming recursion that enables the semi-efficient evaluation of the kernel [6]. [sent-19, score-0.218]

14 String kernels are apparently making no use of the semantic prior knowledge that the structure of words can give and yet they have been used with considerable success. [sent-20, score-0.2]

15 The aim of this paper is to place the n-gram and string kernels in the context of probabilistic modelling of sequences, showing that they can be viewed as Fisher kernels of a Markov generation process. [sent-21, score-0.576]

16 Furthermore, this view also suggests extending consideration to subsequences of varying lengths in the same model. [sent-23, score-0.335]

17 The refined probabilistic model that this affords gives rise to two Fisher kernels depending on the parametrisation that is chosen, if we take the Fisher information matrix to be the identity. [sent-25, score-0.214]

18 We give experimental evidence suggesting that the new kernels are capturing useful properties of the data while overcoming the computational difficulties of the original string kernel. [sent-26, score-0.467]

19 2 The Fisher VIew of the n-gram and String kernels In this section we show how the string kernel can be thought of as a type of Fisher kernel [2] where the fixed-length subsequences used as the features in the string kernel correspond to the parameters for building the model. [sent-27, score-1.711]

20 In order to give some insight into the kernel we first give a Fisher formulat ion of the n-gram kernel (i. [sent-28, score-0.528]

21 the string kernel which considers only contiguous sequences), and then extend this to the full string kernel. [sent-30, score-0.947]

22 Let us assume that we have some document d of length s which is a sequence of symbols belonging to some alphabet A, i. [sent-31, score-0.348]

23 We can consider document d as being generated by a k-stage Markov process. [sent-37, score-0.151]

24 According to this view, for sequences u E A k - l we can define the probability of observing a symbol x after a sequence u as PU--+X. [sent-38, score-0.256]

25 The probability of a document d being generated by the model is therefore Idl = = II Pd[j-k+! [sent-40, score-0.177]

26 :j-l]--+dj j=k opu --+ x Idl = tf(ux,d) (1) P u --+ x where tf(ux,d) is the term frequency of ux in d, that is the number of times the string ux occurs in d. [sent-45, score-0.673]

27 = Different choices of the parameters PU-r X give rise to different models and hence different kernels . [sent-54, score-0.188]

28 It is perhaps surprising that the n-gram kernel is recovered (up to a constant factor) if we set PU-rX = IAI- I for all u E An-l and x E A, that is the least informative parameter setting. [sent-55, score-0.253]

29 This follows since the feature vector of a document d has entries We therefore recover the n-gram kernel as the Fisher kernel of a model which uses a uniform distribution for generating documents. [sent-56, score-0.733]

30 Before considering how the PU-r X might be chosen non-uniformly we turn our attention briefly to the string kernel. [sent-57, score-0.328]

31 We have shown that we can view the n-gram kernel as a Fisher kernel. [sent-58, score-0.246]

32 A little more work is needed in order to place the full string kernel (which considers noncontiguous subsequences) in the same framework. [sent-59, score-0.546]

33 First we define an index set Sk-l,q over all (possibly non-contiguous) subsequences of length k, which finish in position q, Sk-l, q = {i : 1 :'S i l < i2 < . [sent-60, score-0.304]

34 We now define a probability distribution P Sk_1 ,q over Sk - l,q by weighting sequence i by ). [sent-64, score-0.285]

35 We will refer to this model as the Generalised k-stage Markov model with decay fa ctor ). [sent-73, score-0.219]

36 l(i)Xux(d [i ]), where Xux is the indicator function for string ux . [sent-78, score-0.438]

37 It follows that the corresponding Fisher features will be the weighted sum over all subsequences with decay factor A. [sent-79, score-0.305]

38 Proposition 1 The Fisher kernel of the generalised k-stage Markov model with decay fa ctor A and constant Pu--+x is th e string kernel of length k and decay fa ctor A. [sent-81, score-1.34]

39 3 The Finite State Machine Model Viewing the n-gram and string kernels as Fisher kernels of Markov models means we can view the different sequences of k - 1 symbols as defining states with the next symbol controlling the transition to the next state. [sent-82, score-1.006]

40 We therefore arrive at a finite state automaton with states indexed by A k - 1 and transitions labelled by the elements of A . [sent-83, score-0.599]

41 Hence, if u E Ak -l the symbol x E A causes the transition to state v[2: k], where v = ux. [sent-84, score-0.277]

42 One drawback of the string kernel is that the value of k has to be chosen a-priori and is then fixed. [sent-85, score-0.546]

43 A more flexible approach would be to consider different length subsequences as features, depending on their frequency. [sent-86, score-0.29]

44 Subsequences that occur very frequently should be given a low weighting, as they do not contain much information in the same way that stop words are often removed from the bag of words representation. [sent-87, score-0.238]

45 Hence, the 3-gram com could be very frequent and hence not a useful discriminator. [sent-89, score-0.195]

46 By extending it either backwards or forwards we would arrive at subsequences that are less frequent and so potentially carry useful information. [sent-90, score-0.449]

47 Clearly, extending a sequence will always reduce its frequency since the extension could have been made in many distinct ways all of which contribute to the frequency of the root n-gram. [sent-91, score-0.198]

48 As this derivation follows more naturally from the analysis of the n-gram kernel described in Section 2 we will only consider contiguous subsequences also known as substrings. [sent-92, score-0.499]

49 th e non- empty set ~ of states closed under taking substrings, IS xA a triple F = a finit e subset of A* ~ IS 2. [sent-95, score-0.307]

50 the transition function J J: --+~, is defin ed by J(u, x) = v [j : l(v)], wh ere v = ux and j = min{j : v [j : l(v)] E ~}, if the minimum is defined, otherwise the empty sequence f 3. [sent-96, score-0.287]

51 for each state u the function p gives a function Pu, which is either a distribution over next symbols Pu (x) or th e all on e function Pu (x) = 1, for u E ~ and x E A. [sent-97, score-0.302]

52 Given an FSM model F = (~, J, p) to process a document d we start at the state corresponding to the empty sequence f (guaranteed to be in ~ as it is non-empty and closed under taking substrings) and follow the transitions dictated by the symbols of the document. [sent-98, score-0.666]

53 The probability of a document in the model is the product of the values on all of the transitions used: Idl P. [sent-99, score-0.36]

54 Note that requiring that the set ~ to be closed under taking substrings ensures that the minimum in the definition of is is always defined and that d[i j : j] does indeed define the state at stage j (this follows from a simple inductive argument on the sequence of states) . [sent-101, score-0.436]

55 Hence, given an FSM model we can construct the corresponding Fisher kernel feature vector by simply processing the document through the FSM and recording the counts for each transition. [sent-103, score-0.489]

56 The corresponding feature vector will be sparse relative to the dimension of the feature space (the total number of transitions in the FSM) since only those transitions actually used will have non-zero entries. [sent-104, score-0.458]

57 Hence, as for the bag of words we can create feature vectors by listing the indices of transitions used followed by their frequency. [sent-105, score-0.435]

58 The number of non-zero features will be at most equal to the number of symbols in the document. [sent-106, score-0.135]

59 A problem that we have observed when experiment ing with the n-gram model is that if we estimate the frequencies of transitions from the corpus certain transitions can become very frequent while others from the same state occur only rarely. [sent-109, score-0.768]

60 In such cases the rare states will receive a very high weighting in the Fisher score vector. [sent-110, score-0.282]

61 One would like to use the strategy adopted for the idf weighting for the bag of words kernel which is often taken to be where m is the number of documents and m i the number containing term i. [sent-111, score-0.72]

62 The In ensures that the contrast in weighting is controlled. [sent-112, score-0.179]

63 We can obtain this effect in the Fisher kernel if we reparametrise the transition probabilities as follows Pu(x) = exp(- exp( -tu(x))), where tu(x) is the new parameter. [sent-113, score-0.352]

64 With this parametrisation the derivative of the In probabilities becomes a lnpu(x) atu(x) exp(-tu(x )) = -lnpu(x), as required. [sent-114, score-0.139]

65 Although this improves performance the problem of frequent substrings being uninformative remains . [sent-115, score-0.223]

66 We now consider the idea outlined above of moving to longer subsequences in order to ensure that transitions are informative. [sent-116, score-0.391]

67 We select all substrings that have occurred at least t t imes in the document corpus, where t is a small but statistically visible number. [sent-123, score-0.269]

68 Hence, given a corpus S we create the FSM model F t (S) with I;t (S) = {u E A* : u occurs at least t times in the corpus S} . [sent-125, score-0.436]

69 Taking this definition of I;t (S) we construct the corresponding finite state machine model as described in Definition 2. [sent-126, score-0.301]

70 We will refer to the model F t as the frequent set FSM at threshold t. [sent-127, score-0.172]

71 We now construct the transition probabilities by processing the corpus through the F t (S) keeping a tally of the number of times each transition is actually used. [sent-128, score-0.419]

72 Typically we initialise the counts to some constant value c and convert the resulting counts into probabilities for the model. [sent-129, score-0.132]

73 The following proposition demonstrates that the model has the desired frequency properties for the transitions. [sent-132, score-0.184]

74 We use the notation u ~ v to indicate the transition from state u to state v on processing symbol x. [sent-133, score-0.353]

75 Proposition 3 Given a corpus S th e FSM model F t (S) satisfies th e following property. [sent-134, score-0.464]

76 Ign oring transitions from states indexed by d[l : i] for some docum ent d of th e corpus, th e frequ ency counts f u,x for transitions u ~ v in th e corpus S satisfy for all u E I;t (S) . [sent-135, score-1.21]

77 Suppose that for some state u E I;t (S) (3) This implies that the string u has occurred at least tlAI times at the head of a transition not at the beginning of a document. [sent-137, score-0.576]

78 Hence, by the pigeon hole principle there is ayE A such that y has occurred t times immediately before one of the transitions in the sum of (3). [sent-138, score-0.257]

79 Note that this also implies that yu occurs at least t times in the corpus and therefore will be in I;t (S). [sent-139, score-0.282]

80 Consider one of the transitions that occurs after yu on some symbol x . [sent-140, score-0.381]

81 This transition will not be of the form u ~ v but rather yu ~ v contradicting its inclusion in the sum (3). [sent-141, score-0.153]

82 • Note that the proposition implies that no individual transition can be more frequent than the full sum. [sent-143, score-0.35]

83 The proposition also has useful consequences for the maximum weighting for any Fisher score entries as the next corollary demonstrates. [sent-144, score-0.362]

84 Corollary 4 Given a corpus S if we constru ct th e FSM model F t (S) and compute th e probabilities by counting transitions ignoring those from states indexed by d[l : i] for some docum ent d of th e corpus, th e probabilities on th e transitions will satisfy Proof. [sent-145, score-1.544]

85 • The proposition and corollary demonstrate that the choice of Ft(S) as an FSM model has the desirable property that all of the states are meaningfully frequent while none of the transitions is too frequent and furthermore the Fisher weighting cannot grow too large for any individual transition. [sent-147, score-0.892]

86 In the next section we will present exp erimental results testing the kernels we have introduced using the standard and logarithmic weightings. [sent-148, score-0.139]

87 The baseline for the experiments will always be the bag of words kernel using the TFIDF weighting scheme. [sent-149, score-0.574]

88 It is perhaps worth noting that though the IDF weighting appears similar to those described above it makes critical use of the distribution of terms across documents, something that is incompatible with the Fisher approach that we have adopted . [sent-150, score-0.179]

89 We compared the standard n-gram kernel with a Uniform, non-uniform and In weighting scheme, and the variablelength FSM model described in Section 4 both with uniform weighting and a In weighting scheme. [sent-153, score-0.823]

90 In order to keep the comparison fair, the n-gram kernel features were also pruned from the feature vector if they occured less than 10 times . [sent-155, score-0.348]

91 The standard bag of words model using the normal tfidf weighting scheme is used as a baseline. [sent-157, score-0.479]

92 If all positive documents in the corpus are ranked higher than any negative documents, then the average precision is 100%. [sent-163, score-0.293]

93 Average precision incorporates both precision and recall measures and is highly sensitive to document ranking, so therefore can be used to obtain a fair comparison between methods. [sent-164, score-0.269]

94 As can b ee seen from the table, the variable-length subsequence method performs as well as or better than all other methods and achieves a perfect ranking for documents in one of the categories. [sent-166, score-0.161]

95 6 Discussion In this paper we have shown how the string kernel can be thought of as a k-stage Markov process, and as a result interpreted as a Fisher kernel. [sent-228, score-0.578]

96 Using this new insight we have shown how the features of a Fisher kernel can be constructed using a Finite State Model parameterisation which reflects the statistics of the frequency of occurance of features within the corpus. [sent-229, score-0.445]

97 Experimental results have shown that this model outperforms the standard tfidf bag of words model on a well known data set. [sent-232, score-0.326]

98 Although the experiments in this paper are not extensive, they show that the approach of using a Finite-State-Model to generate a Fisher kernel gives new insights and more flexibility over the string kernel, and performs well. [sent-233, score-0.546]

99 Using the fisher kernel method to detect remote protein homologies. [sent-243, score-0.581]

100 In Proceedings of th e Nineteenth International Conference on Machine Learning (ICML '02), 2002. [sent-265, score-0.142]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fisher', 0.363), ('string', 0.328), ('fsm', 0.319), ('kernel', 0.218), ('subsequences', 0.208), ('transitions', 0.183), ('weighting', 0.179), ('pu', 0.164), ('corpus', 0.154), ('document', 0.151), ('frequent', 0.146), ('th', 0.142), ('idl', 0.131), ('tf', 0.128), ('finite', 0.119), ('bag', 0.116), ('kernels', 0.11), ('ux', 0.11), ('proposition', 0.106), ('pd', 0.104), ('symbol', 0.103), ('transition', 0.098), ('tfidf', 0.097), ('documents', 0.094), ('symbols', 0.084), ('definition', 0.08), ('ctor', 0.078), ('parametrisation', 0.078), ('xux', 0.078), ('substrings', 0.077), ('state', 0.076), ('contiguous', 0.073), ('states', 0.066), ('sk', 0.062), ('indexed', 0.061), ('words', 0.061), ('define', 0.059), ('yu', 0.055), ('docum', 0.052), ('idf', 0.052), ('tij', 0.052), ('frequency', 0.052), ('features', 0.051), ('hence', 0.049), ('counts', 0.048), ('arrive', 0.048), ('sequence', 0.047), ('sequences', 0.047), ('extending', 0.047), ('feature', 0.046), ('dj', 0.046), ('decay', 0.046), ('alexei', 0.046), ('saunders', 0.046), ('automaton', 0.046), ('precision', 0.045), ('markov', 0.045), ('flexible', 0.045), ('fa', 0.043), ('uniform', 0.042), ('occurred', 0.041), ('occurs', 0.04), ('corollary', 0.04), ('reflects', 0.039), ('craig', 0.039), ('score', 0.037), ('length', 0.037), ('subsequence', 0.037), ('generalised', 0.037), ('ent', 0.037), ('probabilities', 0.036), ('informative', 0.035), ('fu', 0.035), ('lodhi', 0.035), ('insight', 0.034), ('closed', 0.034), ('times', 0.033), ('taking', 0.033), ('thought', 0.032), ('defining', 0.032), ('empty', 0.032), ('recover', 0.032), ('defined', 0.03), ('holloway', 0.03), ('leave', 0.03), ('ranking', 0.03), ('exp', 0.029), ('create', 0.029), ('give', 0.029), ('alphabet', 0.029), ('fixed', 0.028), ('generation', 0.028), ('occurrences', 0.028), ('fair', 0.028), ('view', 0.028), ('varying', 0.027), ('model', 0.026), ('suggests', 0.025), ('uc', 0.025), ('derivative', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

2 0.24088037 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie

Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classiﬁcation problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efﬁciently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classiﬁer performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢

3 0.21579269 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences).

4 0.21185796 125 nips-2002-Learning Semantic Similarity

Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor

Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1

5 0.18183447 119 nips-2002-Kernel Dependency Estimation

Author: Jason Weston, Olivier Chapelle, Vladimir Vapnik, André Elisseeff, Bernhard Schölkopf

Abstract: We consider the learning problem of finding a dependency between a general class of objects and another, possibly different, general class of objects. The objects can be for example: vectors, images, strings, trees or graphs. Such a task is made possible by employing similarity measures in both input and output spaces using kernel functions, thus embedding the objects into vector spaces. We experimentally validate our approach on several tasks: mapping strings to strings, pattern recognition, and reconstruction from partial images. 1

6 0.15197147 120 nips-2002-Kernel Design Using Boosting

7 0.14339896 113 nips-2002-Information Diffusion Kernels

8 0.13675284 85 nips-2002-Fast Kernels for String and Tree Matching

9 0.1366125 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

10 0.13280003 156 nips-2002-On the Complexity of Learning the Kernel Matrix

11 0.12677115 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

12 0.11606202 167 nips-2002-Rational Kernels

13 0.10610409 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

14 0.10590336 67 nips-2002-Discriminative Binaural Sound Localization

15 0.10435411 106 nips-2002-Hyperkernels

16 0.099620663 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

17 0.099547356 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

18 0.099037595 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval

19 0.092081599 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

20 0.085101269 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.229), (1, -0.134), (2, 0.101), (3, -0.159), (4, -0.291), (5, -0.034), (6, 0.13), (7, 0.025), (8, 0.086), (9, -0.087), (10, -0.08), (11, 0.06), (12, -0.045), (13, 0.087), (14, -0.072), (15, -0.035), (16, 0.037), (17, 0.114), (18, -0.053), (19, -0.016), (20, 0.056), (21, -0.118), (22, -0.163), (23, 0.095), (24, 0.016), (25, 0.047), (26, -0.094), (27, -0.028), (28, 0.0), (29, -0.119), (30, 0.047), (31, 0.023), (32, -0.176), (33, -0.189), (34, 0.024), (35, 0.02), (36, -0.08), (37, 0.021), (38, 0.113), (39, -0.051), (40, 0.035), (41, -0.023), (42, 0.01), (43, -0.026), (44, -0.156), (45, -0.037), (46, -0.092), (47, 0.053), (48, -0.076), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96392882 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

2 0.6522699 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie

3 0.62325168 167 nips-2002-Rational Kernels

Author: Corinna Cortes, Patrick Haffner, Mehryar Mohri

Abstract: We introduce a general family of kernels based on weighted transducers or rational relations, rational kernels, that can be used for analysis of variable-length sequences or more generally weighted automata, in applications such as computational biology or speech recognition. We show that rational kernels can be computed efﬁciently using a general algorithm of composition of weighted transducers and a general single-source shortest-distance algorithm. We also describe several general families of positive deﬁnite symmetric rational kernels. These general kernels can be combined with Support Vector Machines to form efﬁcient and powerful techniques for spoken-dialog classiﬁcation: highly complex kernels become easy to design and implement and lead to substantial improvements in the classiﬁcation accuracy. We also show that the string kernels considered in applications to computational biology are all speciﬁc instances of rational kernels.

4 0.59521687 113 nips-2002-Information Diffusion Kernels

Author: Guy Lebanon, John D. Lafferty

Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold deﬁned by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classiﬁcation.

5 0.58584875 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

6 0.55451429 85 nips-2002-Fast Kernels for String and Tree Matching

7 0.54118091 125 nips-2002-Learning Semantic Similarity

8 0.50993299 119 nips-2002-Kernel Dependency Estimation

9 0.44191548 120 nips-2002-Kernel Design Using Boosting

10 0.42303878 67 nips-2002-Discriminative Binaural Sound Localization

11 0.40772802 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

12 0.3949194 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

13 0.38343686 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

14 0.38229215 156 nips-2002-On the Complexity of Learning the Kernel Matrix

15 0.37978896 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

16 0.36615759 106 nips-2002-Hyperkernels

17 0.34649068 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

18 0.34462196 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text

19 0.32415897 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval

20 0.30326691 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.061), (23, 0.038), (42, 0.098), (51, 0.281), (54, 0.169), (55, 0.032), (67, 0.022), (68, 0.019), (74, 0.069), (87, 0.019), (92, 0.023), (98, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87675989 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

Abstract: Missing data is common in real-world datasets and is a problem for many estimation techniques. We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overﬁtting problems. This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data.

same-paper 2 0.84341735 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

3 0.79365158 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon

Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu

Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,

4 0.68441743 110 nips-2002-Incremental Gaussian Processes

Author: Joaquin Quiñonero-candela, Ole Winther

Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean ﬁeld approach to the intractable classiﬁcation model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.

5 0.63965553 119 nips-2002-Kernel Dependency Estimation

Author: Jason Weston, Olivier Chapelle, Vladimir Vapnik, André Elisseeff, Bernhard Schölkopf

6 0.62327588 85 nips-2002-Fast Kernels for String and Tree Matching

7 0.62153715 125 nips-2002-Learning Semantic Similarity

8 0.62053561 190 nips-2002-Stochastic Neighbor Embedding

9 0.61972606 169 nips-2002-Real-Time Particle Filters

10 0.61960161 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

11 0.61629909 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

12 0.61430883 3 nips-2002-A Convergent Form of Approximate Policy Iteration

13 0.61275536 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

14 0.61125338 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

15 0.60919011 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

16 0.60881305 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

17 0.607557 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

18 0.60458755 46 nips-2002-Boosting Density Estimation

19 0.60348076 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

20 0.60314274 124 nips-2002-Learning Graphical Models with Mercer Kernels