nips nips2002 nips2002-125 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 net Abstract The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. [sent-5, score-1.321]
2 Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. [sent-6, score-1.241]
3 In this paper we propose two methods for inferring such similarity from a corpus. [sent-7, score-0.232]
4 The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. [sent-8, score-0.807]
5 The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. [sent-9, score-1.106]
6 Both approaches produce valid kernel functions parametrised by a real number. [sent-10, score-0.349]
7 The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. [sent-11, score-0.213]
8 This matches very naturally the standard representation used in text retrieval, known as the 'vector space model', where the similarity of two documents is given by the inner product between high dimensional vectors indexed by all the terms present in the corpus. [sent-14, score-0.671]
9 However, such an approach suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms: documents sharing terms that are different but semantically related will be considered as unrelated. [sent-16, score-1.186]
10 A number of attempts have been made to incorporate semantic knowledge into the vector space representation. [sent-17, score-0.504]
11 Semantic networks have been considered [9], whilst others use co-occurrence analysis where a semantic relation is assumed between terms whose occurrence patterns in the documents of the corpus are correlated [3]. [sent-18, score-0.913]
12 Such methods are also limited in their flexibility, and the question of how to infer semantic relations between terms or documents from a corpus remains an open issue. [sent-19, score-0.963]
13 Section 2 provides an introduction to how semantic similarity can be introduced into the vector space model. [sent-22, score-0.677]
14 Section 3 derives a parametrised class of semantic proximity matrices from a recursive definition of similarity of terms and documents. [sent-23, score-1.186]
15 A further parametrised class of kernels based on alternative similarity measures inspired by considering diffusion on a weighted graph of documents is given in Section 4. [sent-24, score-1.043]
16 In Section 5 we show how the recently introduced alignment measure [2] can be used to perform model selection over the classes of kernels we have defined. [sent-25, score-0.335]
17 In the vector space model, a document is represented by a vector indexed by the terms of the corpus. [sent-28, score-0.143]
18 Two documents that use semantically related but distinct words will therefore show no similarity. [sent-30, score-0.392]
19 The aim of a semantic proximity matrix [3] is to correct for this by indicating the strength of the relationship between terms that even though distinct are semantically related. [sent-31, score-0.847]
20 The semantic proximity matrix P is indexed by pairs of terms a and b, with the entry Pab = Pba giving the strength of their semantic similarity. [sent-32, score-1.418]
21 If the vectors corresponding to two documents are d i , d j , their inner product is now evaluated through the kernel k(d i , dj ) = d~Pdj, where x' denotes the transpose of the vector or matrix x. [sent-33, score-0.633]
22 The symmetry of P ensures that the kernel is symmetric. [sent-34, score-0.285]
23 In this case we can decompose P = R' R for some matrix R, so that we can view the semantic similarity as a projection into a semantic space ¢: d f--t Rd, since k(di,dj ) = d~Pdj = (Rd i , Rd j }. [sent-36, score-1.29]
24 The purpose of this paper is to infer (or refine) the similarity measure between examples by taking into account higher order correlations, thereby performing unsupervised learning of the proximity matrix from a given corpus. [sent-37, score-0.444]
25 The first method exploits the fact that the standard representation of text documents as bags of words gives rise to an interesting duality: while documents can be seen as bags of words, simultaneously terms can be viewed as bags of documents - the documents that contain them. [sent-39, score-1.559]
26 In such a model, two documents that have highly correlated term-vectors are considered as having similar content. [sent-40, score-0.308]
27 Similarly, two terms that have a correlated document-vector will have a semantic relation. [sent-41, score-0.582]
28 We show that it is possible to define term-similarity based on document-similarity, and vice versa, to obtain a system of equations that can be solved in order to obtain a semantic proximity matrix P. [sent-43, score-0.826]
29 The second method exploits the representation of a lexicon (the set of all words in a given corpus) as a graph, where the nodes are indexed by words and where cooccurrence is used to establish links between nodes. [sent-44, score-0.405]
30 We consider the idea that higher order correlations between terms can affect their semantic relations as a diffusion process on such a graph. [sent-46, score-1.029]
31 Although there can be exponentially many paths connecting two given nodes in the graph, the use of diffusion kernels [7] enables us to obtain the level of semantic relation between any two nodes efficiently, so inferring the semantic proximity matrix from data. [sent-47, score-1.989]
32 Here the aim is to create recursive equations for the relations between documents and between terms. [sent-49, score-0.407]
33 Let X be the feature example (term/document in the case of text data) matrix in a possibly kernel-defined feature space, so that X' X gives the kernel matrix K and X X' gives the correlations between different features over the training set. [sent-50, score-0.592]
34 Consider the similarity matrices defined recursively by K >"X'GX+K G=>. [sent-52, score-0.233]
35 X'KX+G and (1) We can interpret this as augmenting the similarity given by K through indirect similarities measured by G and vice versa. [sent-54, score-0.274]
36 Clearly, by the symmetry of the definition the expression for G also satisfies the recurrence. [sent-88, score-0.223]
37 _ In view of the form of the solution we introduce the following definition: Definition 2 von Neumann Kernel Given a kernel K the derived kernel K(>. [sent-89, score-0.632]
38 Note that we can view K(>\) as a kernel based on the semantic proximity matrix P = >. [sent-92, score-1.032]
39 Hence, the solution defines a refined similarity between terms/features. [sent-99, score-0.207]
40 In the next section, we will consider the second method of introducing semantic similarity derived from viewing the terms and documents as vertices of a weighted graph. [sent-100, score-0.98]
41 In the case of language, the topological structure of a lexicon graph has recently been analyzed [4]. [sent-102, score-0.165]
42 Such a graph has nodes indexed by all the terms in the corpus, and the edges are given by the co-occurrence between terms in documents of the corpus. [sent-103, score-0.561]
43 A diffusion process on the graph can also be considered as a model of semantic relations existing between indirectly connected terms. [sent-105, score-1.049]
44 Although the number of possible paths between any two given nodes can grow exponentially, results from spectral graph theory have been recently used by [7] to show that it is possible to compute the similarity between any two given nodes efficiently without examining all possible paths. [sent-106, score-0.511]
45 It is also possible to show that the similarity measure obtained in this way is a valid kernel function. [sent-107, score-0.491]
46 The exponentiation operation used in the definition, naturally yields the Mercer conditions required for valid kernel functions. [sent-108, score-0.288]
47 An alternative insight into semantic similarity, to that presented in section 2, is afforded if we multiply out the expression for K(>. [sent-109, score-0.53]
48 ,~}t U1 = i, Ut = j U II KUtUt+l' ÂŁ=1 that is the sum of the products of the weights over all paths of length t that start at vertex i and finish at vertex j in the weighted graph on the examples. [sent-121, score-0.313]
49 If the entries satisfy that they are all positive and for each vertex the sum of the connections is 1, we can view the entry as the probability t hat a random walk beginning at vertex i is at vertex j after t steps. [sent-123, score-0.295]
50 It is for these reasons that the kernels defined using these combinations of powers of the kernel matrix have been termed diffusion kernels [7]. [sent-124, score-0.972]
51 Hence, examples that both lie in a cluster of similar examples become more strongly related, and similar features that occur in a cluster of related features are drawn together in the semantic proximity matrix P. [sent-126, score-0.745]
52 We should stress that the emphasis of this work is not in its diffusion connections, but its relation to semantic proximity. [sent-127, score-0.861]
53 The kernel K combines these indirect link kernels with an exponentially decaying weight. [sent-129, score-0.445]
54 t=1 The next proposition gives the semantic proximity matrix corresponding to K(>"') . [sent-139, score-0.848]
55 Then K(>"') corresponds to a semantic Proof: Let X = UI;V ' be the singular value decomposition of X, so that K = VAV ' is the eigenvalue decomposition of K, where A = I;/I;. [sent-145, score-0.504]
56 _ The above leads to the definition of the second kernel that we consider. [sent-153, score-0.432]
57 Definition 4 Given a kernel K the derived kernels K(>"') = K exp(>. [sent-154, score-0.382]
58 5 Experimental Methods In the previous sections we have introduced two new kernel adaptations, in both cases parameterized by a positive real parameter >. [sent-158, score-0.26]
59 In order to apply these kernels on real text data, we need to develop a method of choosing the parameter >. [sent-162, score-0.218]
60 Rather than adopt this rather expensive methodology we will use a quantitative measure of agreement between the diffusion kernels and the learning task known as alignment, which measures the degree of agreement between a kernel and target [2]. [sent-167, score-0.85]
61 Definition 5 Alignment The (empirical) alignment of a kernel kl with a kernel k2 with respect to the sample S is the quantity A(S,k 1 ,k2 ) = (K 1 ,K2 )F , y! [sent-168, score-0.703]
62 (K1 ,K1 )F(K2,K2)F where Ki is the kernel matrix for the sample S using kernel k i . [sent-169, score-0.602]
63 where we use the following definition of inner products between Gram matrices m (K1 ,K2)F = 2. [sent-170, score-0.277]
64 (K , K) F (yy I, yy I) F y'Ky mllKllF (3) The alignment has been shown to possess several convenient properties [2]. [sent-175, score-0.273]
65 Most notably it can be efficiently computed before any training of the kernel machine takes place, and based only on training data information; and since it is sharply concentrated around its expected value, its empirical value is stable with respect to different splits of the data. [sent-176, score-0.424]
66 to optimize the alignment of the resulting matrix K(>. [sent-180, score-0.265]
67 We rather seek the optimal value using a line search over the range of possible values of A for the value at which the derivative of the alignment with respect to A is zero. [sent-188, score-0.24]
68 Proposition 6 If A* is the solution of A* = argmax~A(S, K(A), yy') and Vi, Ai are the eigenvector/eigenvalue pairs of the kernel matrix K then m m m i=l i=l m i= l i=l L Ai exp(A* Ai)(Vi, y)2 L Proof: First observe that K(A) = V MV' = 2:~1 J. [sent-190, score-0.342]
69 We can express the alignment of K(A) as A(S, K(A), yy') AJ exp(2A* Ai) = J. [sent-192, score-0.183]
70 - Definition 8 Line Search Optimization of the alignment can take place by using a line search of the values of A to find a maximum point of the alignment by seeking points at which the equations given in Proposition 6 and 7 hold. [sent-200, score-0.492]
71 1 Results To demonstrate the performance of the proposed algorithm for text data, the Medline1033 dataset commonly used in text processing [3] was used. [sent-202, score-0.237]
72 This dataset contains 1033 documents and 30 queries obtained from the national library of medicine. [sent-203, score-0.297]
73 Stop words and punctuation were removed from the documents and the Porter stemmer was applied to the words. [sent-206, score-0.341]
74 The terms in the documents were weighted according to a variant of the tfidf scheme. [sent-207, score-0.303]
75 A support vector classifier (SVC) was used to assess the performance of the derived kernels on the Medline dataset. [sent-209, score-0.149]
76 04) Table 1: Medline dataset - Mean and associated standard deviation alignment, F1 and sve error values for a sve trained using the Bag of Words kernel (B) and the exponential kernel (K). [sent-255, score-0.951]
77 07) Table 2: Medline dataset - Mean and associated standard deviation alignment, F1 and sve error values for a sve trained using the Bag of Words kernel (B) and the von Neumann (K). [sent-299, score-0.748]
78 a measure of the proportion of selected items that the system classified correctly, and R represents recall i. [sent-304, score-0.142]
79 Applying the line search procedure to find the optimal value of A for the diffusion kernels. [sent-307, score-0.444]
80 Table 1 shows the results of using the Bag of Words kernel matrix (B) and the exponential kernel matrix (K). [sent-309, score-0.712]
81 Table 2 presents the results of using the von Neumann kernel matrix (K) together with the Bag of Words kernel matrix for different sizes of the training data. [sent-310, score-0.834]
82 The first column of both table 1 and 2 shows the alignments of the Gram matrices to the rank 1 labels matrix for different sizes of training data. [sent-312, score-0.203]
83 In both cases the results presented indicate that the alignment of the diffusion kernels to the labels is greater than that of the Bag of Words kernel matrix by more than the sum of the standard deviations across all sizes of training data. [sent-313, score-1.069]
84 The second column of the tables represents the support vector classifier (SVe) error obtained using the diffusion Gram matrices and the Bag of Words Gram matrix. [sent-314, score-0.449]
85 The sve error for the diffusion kernels shows a decrease with increasing alignment value. [sent-315, score-0.841]
86 F1 values are also shown and in all instances show an improvement for the diffusion kernel matrices. [sent-316, score-0.617]
87 An interesting observation can be made regarding the F1 value for the von Neumann kernel matrix trained using 20% training data (K20). [sent-317, score-0.463]
88 Despite an increase in alignment value and reduction of sve error the F1 value does not increase as much as that for the exponential kernel trained using the same proportion of the data K 20 . [sent-318, score-0.688]
89 This observation implies that the diffusion kernel needs more data to be effective. [sent-319, score-0.617]
90 6 Conclusions We have proposed and compared two different methods to model the notion of semantic similarity between documents, by implicitly defining a proximity matrix P in a way that exploits high order correlations between terms. [sent-321, score-1.02]
91 In one view, we propose a recursive definition of document similarity that depends on term similarity and vice versa. [sent-323, score-0.627]
92 By solving the resulting system of kernel equations, we effectively learn the parameters of the model (P), and construct a kernel function for use in kernel based learning methods. [sent-324, score-0.78]
93 In the other approach, we model semantic relations as a diffusion process in a graph whose nodes are the documents and edges incorporate first-order similarity. [sent-325, score-1.341]
94 Diffusion efficiently takes into account all possible paths connecting two nodes, and propagates the 'similarity' between two remote documents that share 'similar terms'. [sent-326, score-0.374]
95 The kernel resulting from this model is known in the literature as the 'diffusion kernel'. [sent-327, score-0.26]
96 \ in the kernels by optimising their 'alignment' to the target on the training set. [sent-330, score-0.189]
97 For the dataset partitions substantial improvements in performance over the traditional Bag of Words kernel matrix were obtained using the diffusion kernels and the line search method. [sent-331, score-0.923]
98 Despite this success, for large imbalanced datasets such as those encountered in text classification tasks the computational complexity of constructing the diffusion kernels may become prohibitive. [sent-332, score-0.575]
99 Faster kernel construction methods are being investigated for this regime. [sent-333, score-0.285]
100 Support vector machines based on a semantic kernel for text categorization. [sent-372, score-0.86]
wordName wordTfidf (topN-words)
[('semantic', 0.504), ('diffusion', 0.357), ('kernel', 0.26), ('documents', 0.252), ('alignment', 0.183), ('sve', 0.179), ('similarity', 0.173), ('definition', 0.172), ('proximity', 0.159), ('bag', 0.13), ('kernels', 0.122), ('neumann', 0.111), ('ai', 0.104), ('proposition', 0.103), ('svc', 0.102), ('text', 0.096), ('yy', 0.09), ('words', 0.089), ('von', 0.085), ('matrix', 0.082), ('bags', 0.081), ('relations', 0.081), ('graph', 0.078), ('medline', 0.077), ('vertex', 0.07), ('nodes', 0.069), ('gram', 0.068), ('jaz', 0.067), ('exp', 0.065), ('efficiently', 0.062), ('parametrised', 0.061), ('paths', 0.06), ('indexed', 0.06), ('lexicon', 0.057), ('mii', 0.051), ('pdj', 0.051), ('semantically', 0.051), ('terms', 0.051), ('corpus', 0.05), ('nello', 0.049), ('cristianini', 0.048), ('dataset', 0.045), ('retrieval', 0.043), ('vice', 0.042), ('exploits', 0.041), ('inability', 0.041), ('items', 0.04), ('percentage', 0.04), ('inner', 0.039), ('equations', 0.039), ('rd', 0.039), ('align', 0.038), ('proportion', 0.038), ('vi', 0.038), ('correlations', 0.036), ('training', 0.036), ('tf', 0.036), ('recursive', 0.035), ('products', 0.035), ('latent', 0.035), ('inferring', 0.034), ('defines', 0.034), ('indirect', 0.034), ('decay', 0.034), ('proof', 0.034), ('represents', 0.034), ('john', 0.034), ('suffers', 0.033), ('document', 0.032), ('entry', 0.032), ('matrices', 0.031), ('target', 0.031), ('rise', 0.031), ('search', 0.031), ('splits', 0.03), ('topological', 0.03), ('find', 0.03), ('measure', 0.03), ('exponentially', 0.029), ('defined', 0.029), ('df', 0.029), ('sizes', 0.029), ('considered', 0.029), ('exponential', 0.028), ('valid', 0.028), ('mercer', 0.027), ('classifier', 0.027), ('view', 0.027), ('correlated', 0.027), ('expression', 0.026), ('exploit', 0.026), ('giving', 0.026), ('entries', 0.026), ('mostly', 0.026), ('line', 0.026), ('similarities', 0.025), ('agreement', 0.025), ('symmetry', 0.025), ('methods', 0.025), ('table', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
2 0.31200531 113 nips-2002-Information Diffusion Kernels
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
3 0.30703244 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
4 0.2219605 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
5 0.21185796 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
6 0.20953088 120 nips-2002-Kernel Design Using Boosting
7 0.1908783 156 nips-2002-On the Complexity of Learning the Kernel Matrix
8 0.16854346 176 nips-2002-Replay, Repair and Consolidation
9 0.16410361 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
10 0.16048867 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
11 0.15246451 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
12 0.143914 119 nips-2002-Kernel Dependency Estimation
13 0.13963388 106 nips-2002-Hyperkernels
14 0.13598892 115 nips-2002-Informed Projections
15 0.131008 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
16 0.12706454 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
17 0.12554422 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
18 0.11015797 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
19 0.10757072 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
20 0.10478497 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
topicId topicWeight
[(0, -0.254), (1, -0.174), (2, 0.146), (3, -0.166), (4, -0.422), (5, 0.017), (6, 0.141), (7, -0.005), (8, 0.078), (9, -0.177), (10, -0.215), (11, -0.035), (12, 0.051), (13, -0.08), (14, 0.066), (15, 0.084), (16, -0.004), (17, -0.006), (18, -0.014), (19, 0.002), (20, -0.047), (21, 0.03), (22, 0.06), (23, -0.001), (24, -0.012), (25, 0.059), (26, 0.055), (27, 0.088), (28, -0.049), (29, 0.071), (30, -0.012), (31, -0.065), (32, -0.002), (33, -0.038), (34, 0.057), (35, -0.129), (36, 0.117), (37, -0.057), (38, -0.022), (39, -0.052), (40, -0.063), (41, -0.075), (42, -0.063), (43, 0.09), (44, -0.024), (45, -0.001), (46, -0.062), (47, -0.011), (48, -0.028), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.9588896 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
2 0.78665394 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
3 0.69340289 113 nips-2002-Information Diffusion Kernels
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
4 0.68716997 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
5 0.56820488 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
6 0.54486042 176 nips-2002-Replay, Repair and Consolidation
7 0.53513414 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
8 0.53249085 120 nips-2002-Kernel Design Using Boosting
9 0.53099769 156 nips-2002-On the Complexity of Learning the Kernel Matrix
10 0.49491766 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
11 0.49401677 106 nips-2002-Hyperkernels
12 0.44423196 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
13 0.44324842 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
14 0.44322321 167 nips-2002-Rational Kernels
15 0.4394452 119 nips-2002-Kernel Dependency Estimation
16 0.41395843 115 nips-2002-Informed Projections
17 0.41143629 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
18 0.39758688 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
19 0.38405678 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
20 0.37698552 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
topicId topicWeight
[(11, 0.338), (23, 0.016), (42, 0.096), (51, 0.011), (54, 0.158), (55, 0.034), (57, 0.012), (67, 0.026), (68, 0.032), (74, 0.083), (92, 0.027), (98, 0.08)]
simIndex simValue paperId paperTitle
1 0.97018248 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch
Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun
Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.
2 0.90739673 174 nips-2002-Regularized Greedy Importance Sampling
Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi
Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.
3 0.86736751 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations
Author: Elzbieta Pekalska, David Tax, Robert Duin
Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identification or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classifier for dissimilarity representations. By the use of linear programming an efficient one-class description can be found, based on a small number of prototype objects. This classifier can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classifier by Campbell and Bennett is given.
same-paper 4 0.86461699 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
5 0.71025336 147 nips-2002-Monaural Speech Separation
Author: Guoning Hu, Deliang Wang
Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.
6 0.6885587 41 nips-2002-Bayesian Monte Carlo
7 0.6863122 163 nips-2002-Prediction and Semantic Association
8 0.64635426 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
9 0.64252156 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
10 0.63690752 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
11 0.63373291 169 nips-2002-Real-Time Particle Filters
12 0.62589842 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
13 0.61910164 167 nips-2002-Rational Kernels
14 0.61786306 85 nips-2002-Fast Kernels for String and Tree Matching
15 0.60783041 190 nips-2002-Stochastic Neighbor Embedding
16 0.6076259 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
17 0.60750622 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
18 0.60451901 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering
19 0.59949219 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
20 0.59837252 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters