nips nips2002 nips2002-158 knowledge-graph by maker-knowledge-mining

158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Source: pdf

Author: Elzbieta Pekalska, David Tax, Robert Duin

Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identiﬁcation or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classiﬁer for dissimilarity representations. By the use of linear programming an efﬁcient one-class description can be found, based on a small number of prototype objects. This classiﬁer can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classiﬁer by Campbell and Bennett is given.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. [sent-12, score-0.062]

2 These applications come from the areas of machine diagnostics, fault detection, illness identiﬁcation or, in principle, refer to any problem where little knowledge is available outside the typical class. [sent-13, score-0.158]

3 In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classiﬁer for dissimilarity representations. [sent-14, score-0.471]

4 By the use of linear programming an efﬁcient one-class description can be found, based on a small number of prototype objects. [sent-15, score-0.102]

5 This classiﬁer can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. [sent-16, score-0.186]

6 The area of interest covers all the problems, where the speciﬁed targets have to be recognized and the anomalies or outlier instances have to be detected. [sent-19, score-0.118]

7 Those might be examples of any type of fault detection, abnormal behavior, rare illnesses, etc. [sent-20, score-0.152]

8 One possible approach to class description problems is to construct oneclass classiﬁers (OCCs) [13]. [sent-21, score-0.068]

9 An efﬁcient OCC built in a feature space can be found by determining a minimal volume hypersphere around the data [14, 13] or by determining a hyperplane such that it separates the data from the origin as well as possible [11, 12]. [sent-25, score-0.181]

10 To avoid the deﬁnition of an explicit feature space, we have already proposed to address kernels as general proximity measures [10] and not only as symmetric, (conditionally) positive deﬁnite functions of two variables [2]. [sent-33, score-0.09]

11 Such a proximity should directly arise from an application; see e. [sent-34, score-0.071]

12 Therefore, our reasoning starts not from a feature space, like in case of the other methods [15, 11, 12, 14], but from a given proximity representation. [sent-37, score-0.095]

13 The identiﬁcation procedure is realized by a proximity function equipped with a threshold, determining whether an instance is a class member or not. [sent-40, score-0.126]

14 OCCs, since the proximity function can be directly built on them. [sent-46, score-0.091]

15 In this paper, we propose a simple and efﬁcient OCC for general dissimilarity representations, discussed in Section 2, found by the use of linear programming (LP). [sent-47, score-0.415]

16 Section 3 presents our method together with a dissimilarity transformation to make it more robust against objects with large dissimilarities. [sent-48, score-0.585]

17 2 Dissimilarity representations Although a dissimilarity measure D provides a ﬂexible way to represent the data, there are some constraints. [sent-51, score-0.405]

18 We do not require that D is a strict metric, since non-metric dissimilarities may naturally be found when shapes or objects in images are compared e. [sent-54, score-0.296]

19 A dissimilarity representation can now be seen as a dissimilarity kernel based on the representation set R = {p 1 , . [sent-58, score-0.791]

20 R controls the dimensionality of a dissimilarity space D(·, R). [sent-64, score-0.353]

21 Therefore, to emphasize that objects, like z or pi , might not be feature vectors, they will not be printed in bold. [sent-66, score-0.063]

22 It states that similar objects are close in their representations. [sent-68, score-0.206]

23 For a dissimilarity measure D, this means that D(r, s) is small if objects r and s are similar. [sent-69, score-0.559]

24 If we demand that D(r, s) = 0, if and only if the objects r and s are identical, this implies that they belong to the same class. [sent-70, score-0.206]

25 This can be extended by assuming that all objects s such that D(r, s) < ε, for a sufﬁcient small ε, are so similar to r that they are members of the same class. [sent-71, score-0.206]

26 Consequently, D(r, t) ≈ D(s, t) for other objects t. [sent-72, score-0.206]

27 Therefore, for dissimilarity representations satisfying the above continuity, the reverse of the CH holds: objects similar in their representations are similar in reality and belong, thereby, to the same class [6, 10]. [sent-73, score-0.691]

28 When the set R contains objects from the class of interest, then objects z with large D(z, R) are outliers and should be remote from the origin in this dissimilarity space. [sent-75, score-0.974]

29 If the dissimilarity measure D is a metric, then all vectors D(z, R), lie in an open prism (unbounded from above1), bounded from below by a hyperplane on which the objects from R are. [sent-77, score-0.799]

30 In principle, z may be placed anywhere in the dissimilarity space D(·, R) only if the triangle inequality is completely violated. [sent-78, score-0.385]

31 the triangle inequalities are only somewhat violated) and, thereby, D(z, R) will still lie either in the prism or in its close neigbourhood. [sent-83, score-0.172]

32 A natural extension is to minimize the volume of a simplex with the main vertex being the origin and the other vertices v j resulting from the intersection of P and the axes of the dissimilarity space (v j is a vector of all zero elements except for vji = ρ/wi , given that wi = 0). [sent-86, score-0.49]

33 Assume now that there are M non-zero weights of the hyperplane P , so effectively, P is constructed in a RM . [sent-87, score-0.075]

34 From geometry we know that the volume V of such a simplex can be expressed as V = (VBase /M ! [sent-88, score-0.062]

35 ) · (ρ/||w||2 ), where VBase is the volume of the base, deﬁned by the vertices v j . [sent-89, score-0.049]

36 the Euclidean distance from the origin to P , is then related to the minimization of V . [sent-92, score-0.076]

37 Let {D(pi , R)}N , N = |R| be a dissimilarity representation, bounded by a hyperplane P , i=1 i. [sent-93, score-0.453]

38 , N , such that the Lq distance to the origin dq (0, P ) = ρ/||w||p is the smallest (i. [sent-98, score-0.076]

39 The mathematical programming formulation of such a problem is [9, 1]: min ρ (1) s. [sent-103, score-0.126]

40 Therefore, by minimizing d∞ (0, P ) = ρ, (and ||w||1 = 1), h will be bounded and the volume of the simplex considered, as well. [sent-111, score-0.087]

41 By the above reasoning and (1), a class represented by dissimilarities can be characterized by a linear proximity function with the weights w and the threshold ρ. [sent-112, score-0.213]

42 Our one-class classiﬁer CLPDD , Linear Programming Dissimilarity-data Description, is then deﬁned as: CLPDD (D(z, ·)) = I( wj D(z, pj ) ≤ ρ), (2) wj =0 where I is the indicator function. [sent-113, score-0.172]

43 The proximity function is found as the solution to a soft margin formulation (which is a straightforward extension of the hard margin case) with ν ∈ (0, 1] being the upper bound on the outlier fraction for the target class: N min ρ + ν 1 i=1 ξi N s. [sent-114, score-0.333]

44 Objects corresponding to such non-zero weights, will be called support objects (SO). [sent-120, score-0.243]

45 The data is represented in a metric dissimilarity space, and by the triangle inequality the data can only be inside the prism indicated by the dashed lines. [sent-123, score-0.533]

46 The LPDD boundary is given by the hyperplane, as close to the origin as possible (by minimizing ρ), while still accepting (most) target objects. [sent-124, score-0.146]

47 By the discussion in Section 2, the outliers should be remote from the origin. [sent-125, score-0.125]

48 In (3), ν ∈ (0, 1] is the upper bound on the outlier fraction for the target class, i. [sent-127, score-0.138]

49 the fraction of objects that lie outside the boundary; see also [11, 12]. [sent-129, score-0.265]

50 N 2 P is not expected to be parallel to the prism’s bottom hyperplane D(. [sent-131, score-0.075]

51 ,x i) Figure 1: Illustrations of the LPDD in the dissimilarity space (left) and the LPSD in the similarity space (right). [sent-135, score-0.377]

52 The LPDD tries to minimize the max-norm distance from the bounding hyperplane to the origin, while the LPSD tries to attract the hyperplane towards the average of the distribution. [sent-137, score-0.17]

53 If D is unbounded, then some atypical objects of the target class (i. [sent-145, score-0.284]

54 Therefore, we propose a nonlinear, monotonous transformation of the distances to the interval [0, 1] such that locally the distances are scaled linearly and globally, all large distances become close to 1. [sent-148, score-0.102]

55 A function with such properties is the sigmoid function (the hyperbolical tangent can also be used), i. [sent-149, score-0.045]

56 Now, the transformation can be applied in an element-wise way to the dissimilarity representation such that Ds (z, pi ) = Sigm(D(z, pi )). [sent-154, score-0.512]

57 Recently, Campbell and Bennett have proposed an LP formulation for novelty detection [3]. [sent-157, score-0.127]

58 Here, to be consistent with our LPDD method, we rewrite their soft-margin LP formulation (a hard margin formulation is then obvious), to include a trade-off parameter ν (which lacks, however, the interpretation as given in the LPDD), as follows: min s. [sent-165, score-0.127]

59 The CLPSD is then deﬁned as: CLPSD (K(z, ·)) = I( wj K(z, xj ) + ρ ≥ 0). [sent-171, score-0.086]

60 Here, the data is represented in a similarity space, such that all objects lie in a hypercube between 0 and 1. [sent-174, score-0.268]

61 Objects remote from the representation objects will be close to the origin. [sent-175, score-0.267]

62 The hyperplane is optimized to have minimal average output for the whole target set. [sent-176, score-0.125]

63 This does not necessarily mean a good separation from the origin or a small volume of the OCC, possibly resulting in an unnecessarily high outlier acceptance rate. [sent-177, score-0.199]

64 If the computation of (dis)similarities is very costly, one can consider a reduced representation set Rred ⊂ R, consisting of n < N objects. [sent-242, score-0.07]

65 Then, a dissimilarity or similar< ity representations are given as rectangular matrices D(R, Rred ) or K(S, Sred ), respectively. [sent-243, score-0.439]

66 An another reason to consider reduced representations is the robustness against outliers. [sent-245, score-0.089]

67 The ﬁrst dataset contains two clusters with objects represented by Euclidean distances. [sent-249, score-0.231]

68 The objects are represented by a slightly non-metric L0. [sent-251, score-0.206]

69 The parameter s used in all plots refers either to the scaling parameter in the sigmoid function for the LPDD (based on D s ) or to the scaling parameter in the RBF kernel. [sent-260, score-0.146]

70 The pictures show similar behavior of both the LPDD and the LPSD; the LPDD tends to be just slightly more tight around the target class. [sent-261, score-0.05]

71 6dm , 3dm , 8dm , where dm is the median distance of all the distances. [sent-315, score-0.07]

72 The objects of the reduced sets Rred and Sred are marked by triangles. [sent-412, score-0.276]

73 3 and 4, where three outliers lying outside a single uniformly distributed cluster should be ignored when an OCC with a soft margin is trained. [sent-417, score-0.148]

74 From these ﬁgures, we can observe that the LPDD gives a tighter class description, which is more robust against the scaling parameter and more robust against outliers, as well. [sent-418, score-0.11]

75 5 presents some results for the reduced representations, in which just 6 objects are randomly chosen for the set Rred . [sent-422, score-0.243]

76 In the left four plots, Rred contains objects from the uniform cluster only, and both methods perform equally well. [sent-423, score-0.206]

77 It can be judged that for a suitable scaling s, no outliers become support objects in the LPDD, which is often a case for the LPSD; see also Fig. [sent-425, score-0.37]

78 Fault detection is an important problem in the machine diagnostics: failure to detect faults can lead to machine damage, while false alarms can lead to unnecessary expenses. [sent-433, score-0.067]

79 As an example, we will consider a detection of four types of fault in ball-bearing cages, a dataset [16] considered in [3]. [sent-434, score-0.187]

80 The dataset consists of ﬁve categories: normal behavior (NB), corresponding Table 1: The errors of the ﬁrst and second kind (in %) of the LPDD and LPSD on two dissimilarity representations for the ball-bearing data. [sent-437, score-0.43]

81 0 L1 dissimilarity representation Method Optimal s # of SO NB T1 566. [sent-451, score-0.386]

82 5 to measurements made from new ball-bearings, and four types of anomalies, say, T 1 – T4 , corresponding either to the damaged outer race or cages or a badly worn ball-bearing. [sent-487, score-0.059]

83 Test LPSD methods by the use of the validation set on the EuNB 913 913 913 clidean and L1 dissimilarity representations. [sent-491, score-0.374]

84 It can be concluded that the T2 913 996 L1 representation is far more convenient for the fault deT3 996 tection, especially if we look at the fault type T3 and T4 T4 996 which were unseen in the validation process. [sent-493, score-0.29]

85 The LPSD performs better on normal instances (yields a smaller erFigure 6: Fault detection data. [sent-494, score-0.044]

86 This is to be expected, since the boundary is less tight, by which less support objects (SO) are needed. [sent-496, score-0.283]

87 Note also that the reduced representation, based on randomly chosen 180 target objects (≈ 20%) mostly yields signiﬁcantly better performances in outlier detection for the LPDD, and in target acceptance for the LPSD. [sent-501, score-0.5]

88 Therefore, we can conclude that if a failure in the fault detection has higher costs than the cost of misclassifying target objects, then our approach should be recommended. [sent-502, score-0.235]

89 5 Conclusions We have proposed the Linear Programming Dissimilarity-data Description (LPDD) classiﬁer, directly built on dissimilarity representations. [sent-503, score-0.373]

90 This method is efﬁcient, which means that only some objects are needed for the computation of dissimilarities in a test phase. [sent-504, score-0.296]

91 The novelty of our approach lies in its reformulation for general dissimilarity measures, which, we think, is more natural for class descriptors. [sent-505, score-0.431]

92 Since dissimilarity measures might be unbounded, we have also proposed to transform dissimilarities by the sigmoid function, which makes the LPDD more robust against objects with large dissimilarities. [sent-506, score-0.72]

93 For all considered datasets, the LPDD yields a more compact target description than the LPSD. [sent-509, score-0.09]

94 The LPDD is more robust against outliers in the training set, in particular, when only some objects are considered for a reduced representation. [sent-510, score-0.366]

95 Moreover, with a proper scaling parameter s of the sigmoid function, the support objects in the LPDD do not contain outliers, while it seems difﬁcult for the LPSD to achieve the same. [sent-511, score-0.318]

96 In the original formulation, the support objects of the LPSD tend to lie on the boundary, while for the LPDD, they are mostly ’inside’ the boundary. [sent-512, score-0.281]

97 In summary, our LPDD method can be recommended when the failure to detect outliers is more expensive than the costs of a false alarm. [sent-514, score-0.12]

98 Such a constant should be related to the dissimilarity values in the neighborhood of the boundary. [sent-516, score-0.353]

99 The authors are solely responsible for information communicated and the European Commission is not responsible for any views or results expressed. [sent-520, score-0.044]

100 Combining support vector and mathematical programming methods for induction. [sent-526, score-0.099]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lpdd', 0.613), ('lpsd', 0.46), ('dissimilarity', 0.353), ('objects', 0.206), ('rred', 0.119), ('fault', 0.118), ('lp', 0.112), ('occ', 0.102), ('prism', 0.102), ('outliers', 0.097), ('dissimilarities', 0.09), ('outlier', 0.088), ('wj', 0.086), ('hyperplane', 0.075), ('proximity', 0.071), ('clpdd', 0.068), ('pi', 0.063), ('programming', 0.062), ('euclidean', 0.056), ('origin', 0.056), ('representations', 0.052), ('delft', 0.051), ('novelty', 0.05), ('target', 0.05), ('classi', 0.048), ('sigmoid', 0.045), ('dis', 0.044), ('detection', 0.044), ('descriptors', 0.041), ('bennett', 0.041), ('boundary', 0.04), ('description', 0.04), ('lie', 0.038), ('reduced', 0.037), ('support', 0.037), ('unbounded', 0.036), ('campbell', 0.036), ('abnormal', 0.034), ('cages', 0.034), ('clpsd', 0.034), ('occs', 0.034), ('sigm', 0.034), ('sred', 0.034), ('vbase', 0.034), ('rectangular', 0.034), ('nb', 0.034), ('distances', 0.034), ('marked', 0.033), ('representation', 0.033), ('formulation', 0.033), ('simplex', 0.032), ('triangle', 0.032), ('wt', 0.032), ('ch', 0.031), ('min', 0.031), ('ers', 0.03), ('scaling', 0.03), ('margin', 0.03), ('volume', 0.03), ('anomalies', 0.03), ('dm', 0.03), ('diagnostics', 0.03), ('class', 0.028), ('remote', 0.028), ('realized', 0.027), ('netherlands', 0.027), ('er', 0.026), ('robust', 0.026), ('metric', 0.026), ('badly', 0.025), ('proximities', 0.025), ('acceptance', 0.025), ('bounded', 0.025), ('dataset', 0.025), ('arti', 0.024), ('cial', 0.024), ('similarity', 0.024), ('reasoning', 0.024), ('lkopf', 0.024), ('failure', 0.023), ('sch', 0.023), ('refers', 0.022), ('responsible', 0.022), ('compactness', 0.022), ('pn', 0.022), ('validation', 0.021), ('pami', 0.021), ('outside', 0.021), ('distance', 0.02), ('built', 0.02), ('inside', 0.02), ('median', 0.02), ('formulations', 0.02), ('smola', 0.019), ('vertices', 0.019), ('kernel', 0.019), ('kernels', 0.019), ('plots', 0.019), ('xi', 0.019), ('refer', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Author: Elzbieta Pekalska, David Tax, Robert Duin

2 0.15539891 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classiﬁcation task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to diﬀerent sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classiﬁer which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive deﬁnite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classiﬁcation and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classiﬁcation results. 1

3 0.083061844 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

Author: Christopher Williams, Michalis K. Titsias

Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of conﬁgurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.

4 0.063882619 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1

5 0.060038034 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors deﬁning the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters deﬁning the classiﬁer. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

6 0.057401448 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

7 0.055532318 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

8 0.055046659 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

9 0.054026123 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

10 0.052326407 119 nips-2002-Kernel Dependency Estimation

11 0.050527368 45 nips-2002-Boosted Dyadic Kernel Discriminants

12 0.049204163 190 nips-2002-Stochastic Neighbor Embedding

13 0.048084334 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

14 0.047547337 156 nips-2002-On the Complexity of Learning the Kernel Matrix

15 0.04714229 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

16 0.047131438 120 nips-2002-Kernel Design Using Boosting

17 0.046954125 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

18 0.04528622 92 nips-2002-FloatBoost Learning for Classification

19 0.044014785 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

20 0.042467918 178 nips-2002-Robust Novelty Detection with Single-Class MPM

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.136), (1, -0.072), (2, 0.038), (3, 0.002), (4, 0.046), (5, -0.017), (6, 0.04), (7, -0.084), (8, -0.025), (9, 0.05), (10, 0.003), (11, 0.023), (12, -0.036), (13, -0.029), (14, 0.041), (15, -0.034), (16, 0.006), (17, -0.013), (18, 0.059), (19, 0.006), (20, -0.085), (21, 0.007), (22, 0.062), (23, 0.012), (24, -0.073), (25, -0.047), (26, 0.082), (27, -0.009), (28, -0.065), (29, 0.045), (30, -0.038), (31, 0.123), (32, -0.202), (33, 0.08), (34, -0.031), (35, -0.102), (36, 0.057), (37, 0.071), (38, 0.054), (39, 0.048), (40, -0.073), (41, -0.055), (42, 0.162), (43, 0.043), (44, 0.1), (45, 0.008), (46, -0.142), (47, 0.075), (48, 0.008), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93073273 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Author: Elzbieta Pekalska, David Tax, Robert Duin

2 0.59653902 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

3 0.58463538 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

Author: Christopher Williams, Michalis K. Titsias

4 0.54073584 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

Author: Sergey Kirshner, Igor V. Cadez, Padhraic Smyth, Chandrika Kamath

Abstract: We describe the application of probabilistic model-based learning to the problem of automatically identifying classes of galaxies, based on both morphological and pixel intensity characteristics. The EM algorithm can be used to learn how to spatially orient a set of galaxies so that they are geometrically aligned. We augment this “ordering-model” with a mixture model on objects, and demonstrate how classes of galaxies can be learned in an unsupervised manner using a two-level EM algorithm. The resulting models provide highly accurate classi£cation of galaxies in cross-validation experiments. 1 Introduction and Background The £eld of astronomy is increasingly data-driven as new observing instruments permit the rapid collection of massive archives of sky image data. In this paper we investigate the problem of identifying bent-double radio galaxies in the FIRST (Faint Images of the Radio Sky at Twenty-cm) Survey data set [1]. FIRST produces large numbers of radio images of the deep sky using the Very Large Array at the National Radio Astronomy Observatory. It is scheduled to cover more that 10,000 square degrees of the northern and southern caps (skies). Of particular scienti£c interest to astronomers is the identi£cation and cataloging of sky objects with a “bent-double” morphology, indicating clusters of galaxies ([8], see Figure 1). Due to the very large number of observed deep-sky radio sources, (on the order of 106 so far) it is infeasible for the astronomers to label all of them manually. The data from the FIRST Survey (http://sundog.stsci.edu/) is available in both raw image format and in the form of a catalog of features that have been automatically derived from the raw images by an image analysis program [8]. Each entry corresponds to a single detectable “blob” of bright intensity relative to the sky background: these entries are called Figure 1: 4 examples of radio-source galaxy images. The two on the left are labelled as “bent-doubles” and the two on the right are not. The con£gurations on the left have more “bend” and symmetry than the two non-bent-doubles on the right. components. The “blob” of intensities for each component is £tted with an ellipse. The ellipses and intensities for each component are described by a set of estimated features such as sky position of the centers (RA (right ascension) and Dec (declination)), peak density ¤ux and integrated ¤ux, root mean square noise in pixel intensities, lengths of the major and minor axes, and the position angle of the major axis of the ellipse counterclockwise from the north. The goal is to £nd sets of components that are spatially close and that resemble a bent-double. In the results in this paper we focus on candidate sets of components that have been detected by an existing spatial clustering algorithm [3] where each set consists of three components from the catalog (three ellipses). As of the year 2000, the catalog contained over 15,000 three-component con£gurations and over 600,000 con£gurations total. The set which we use to build and evaluate our models consists of a total of 128 examples of bent-double galaxies and 22 examples of non-bent-double con£gurations. A con£guration is labelled as a bent-double if two out of three astronomers agree to label it as such. Note that the visual identi£cation process is the bottleneck in the process since it requires signi£cant time and effort from the scientists, and is subjective and error-prone, motivating the creation of automated methods for identifying bent-doubles. Three-component bent-double con£gurations typically consist of a center or “core” component and two other side components called “lobes”. Previous work on automated classi£cation of three-component candidate sets has focused on the use of decision-tree classi£ers using a variety of geometric and image intensity features [3]. One of the limitations of the decision-tree approach is its relative in¤exibility in handling uncertainty about the object being classi£ed, e.g., the identi£cation of which of the three components should be treated as the core of a candidate object. A bigger limitation is the £xed size of the feature vector. A primary motivation for the development of a probabilistic approach is to provide a framework that can handle uncertainties in a ¤exible coherent manner. 2 Learning to Match Orderings using the EM Algorithm We denote a three-component con£guration by C = (c 1 , c2 , c3 ), where the ci ’s are the components (or “blobs”) described in the previous section. Each component c x is represented as a feature vector, where the speci£c features will be de£ned later. Our approach focuses on building a probabilistic model for bent-doubles: p (C) = p (c1 , c2 , c3 ), the likelihood of the observed ci under a bent-double model where we implicitly condition (for now) on the class “bent-double.” By looking at examples of bent-double galaxies and by talking to the scientists studying them, we have been able to establish a number of potentially useful characteristics of the components, the primary one being geometric symmetry. In bent-doubles, two of the components will look close to being mirror images of one another with respect to a line through the third component. We will call mirror-image components lobe compo- core core 1 lobe 2 2 3 lobe 2 lobe 1 lobe 1 component 2 lobe 1 component 3 lobe 2 lobe 1 4 core lobe 1 5 lobe 2 lobe 2 6 core core component 1 core lobe 2 lobe 1 Figure 2: Possible orderings for a hypothetical bent-double. A good choice of ordering would be either 1 or 2. nents, and the other one the core component. It also appears that non-bent-doubles either don’t exhibit such symmetry, or the angle formed at the core component is too straight— the con£guration is not “bent” enough. Once the core component is identi£ed, we can calculate symmetry-based features. However, identifying the most plausible core component requires either an additional algorithm or human expertise. In our approach we use a probabilistic framework that averages over different possible orderings weighted by their probability given the data. In order to de£ne the features, we £rst need to determine the mapping of the components to labels “core”, “lobe 1”, and “lobe 2” (c, l1 , and l2 for short). We will call such a mapping an ordering. Figure 2 shows an example of possible orderings for a con£guration. We can number the orderings 1, . . . , 6. We can then write 6 p (C) = p (cc , cl1 , cl2 |Ω = k) p (Ω = k) , (1) k=1 i.e., a mixture over all possible orientations. Each ordering is assumed a priori to be equally 1 likely, i.e., p(Ω = k) = 6 . Intuitively, for a con£guration that clearly looks like a bentdouble the terms in the mixture corresponding to the correct ordering would dominate, while the other orderings would have much lower probability. We represent each component cx by M features (we used M = 3). Note that the features can only be calculated conditioned on a particular mapping since they rely on properties of the (assumed) core and lobe components. We denote by fmk (C) the values corresponding to the mth feature for con£guration C under the ordering Ω = k, and by f mkj (C) we denote the feature value of component j: fmk (C) = (fmk1 (C) , . . . , fmkBm (C)) (in our case, Bm = 3 is the number of components). Conditioned on a particular mapping Ω = k, where x ∈ {c, l1 , l2 } and c,l1 ,l2 are de£ned in a cyclical order, our features are de£ned as: • f1k (C) : Log-transformed angle, the angle formed at the center of the component (a vertex of the con£guration) mapped to label x; |center of x to center of next(x)| • f2k (C) : Logarithms of side ratios, center of x to center of prev(x) ; | | peak ¤ux of next(x) • f3k (C) : Logarithms of intensity ratios, peak ¤ux of prev(x) , and so (C|Ω = k) = (f1k (C) , f2k (C) f3k (C)) for a 9-dimensional feature vector in total. Other features are of course also possible. For our purposes in this paper this particular set appears to capture the more obvious visual properties of bent-double galaxies. For a set D = {d1 , . . . , dN } of con£gurations, under an i.i.d. assumption for con£gurations, we can write the likelihood as N K P (D) = P (Ωi = k) P (f1k (di ) , . . . , fM k (di )) , i=1 k=1 where Ωi is the ordering for con£guration d i . While in the general case one can model P (f1k (di ) , . . . , fM k (di )) as a full joint distribution, for the results reported in this paper we make a number of simplifying assumptions, motivated by the fact that we have relatively little labelled training data available for model building. First, we assume that the fmk (di ) are conditionally independent. Second, we are also able to reduce the number of components for each fmk (di ) by noting functional dependencies. For example, given two angles of a triangle, we can uniquely determine the third one. We also assume that the remaining components for each feature are conditionally independent. Under these assumptions the multivariate joint distribution P (f1k (di ) , . . . , fM k (di )) is factored into a product of simple distributions, which (for the purposes of this paper) we model using Gaussians. If we know for every training example which component should be mapped to label c, we can then unambiguously estimate the parameters for each of these distributions. In practice, however, the identity of the core component is unknown for each object. Thus, we use the EM algorithm to automatically estimate the parameters of the above model. We begin by randomly assigning an ordering to each object. For each subsequent iteration the E-step consists of estimating a probability distribution over possible orderings for each object, and the M-step estimates the parameters of the feature-distributions using the probabilistic ordering information from the E-step. In practice we have found that the algorithm converges relatively quickly (in 20 to 30 iterations) on both simulated and real data. It is somewhat surprising that this algorithm can reliably “learn” how to align a set of objects, without using any explicit objective function for alignment, but instead based on the fact that feature values for certain orderings exhibit a certain self-consistency relative to the model. Intuitively it is this self-consistency that leads to higher-likelihood solutions and that allows EM to effectively align the objects by maximizing the likelihood. After the model has been estimated, the likelihood of new objects can also be calculated under the model, where the likelihood now averages over all possible orderings weighted by their probability given the observed features. The problem described above is a speci£c instance of a more general feature unscrambling problem. In our case, we assume that con£gurations of three 3-dimensional components (i.e. 3 features) each are generated by some distribution. Once the objects are generated, the orders of their components are permuted or scrambled. The task is then to simultaneously learn the parameters of the original distributions and the scrambling for each object. In the more general form, each con£guration consists of L M -dimensional con£gurations. Since there are L! possible orderings of L components, the problem becomes computationally intractable if L is large. One solution is to restrict the types of possible scrambles (to cyclic shifts for example). 3 Automatic Galaxy Classi£cation We used the algorithm described in the previous section to estimate the parameters of features and orderings of the bent-double class from labelled training data and then to rank candidate objects according to their likelihood under the model. We used leave-one-out cross-validation to test the classi£cation ability of this supervised model, where for each of the 150 examples we build a model using the positive examples from the set of 149 “other” examples, and then score the “left-out” example with this model. The examples are then sorted in decreasing order by their likelihood score (averaging over different possi- 1 0.9 True positive rate 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate Figure 3: ROC plot for a model using angle, ratio of sides, and ratio of intensities, as features, and learned using ordering-EM with labelled data. ble orderings) and the results are analyzed using a receiver operating characteristic (ROC) methodology. We use AROC , the area under the curve, as a measure of goodness of the model, where a perfect model would have AROC = 1 and random performance corresponds to AROC = 0.5. The supervised model, using EM for learning ordering models, has a cross-validated AROC score of 0.9336 (Figure 3) and appears to be quite useful at detecting bent-double galaxies. 4 Model-Based Galaxy Clustering A useful technique in understanding astronomical image data is to cluster image objects based on their morphological and intensity properties. For example, consider how one might cluster the image objects in Figure 1 into clusters, where we have features on angles, intensities, and so forth. Just as with classi£cation, clustering of the objects is impeded by not knowing which of the “blobs” corresponds to the true “core” component. From a probabilistic viewpoint, clustering can be treated as introducing another level of hidden variables, namely the unknown class (or cluster) identity of each object. We can generalize the EM algorithm for orderings (Section 2) to handle this additional hidden level. The model is now a mixture of clusters where each cluster is modelled as a mixture of orderings. This leads to a more complex two-level EM algorithm than that presented in Section 2, where at the inner-level the algorithm is learning how to orient the objects, and at the outer level the algorithm is learning how to group the objects into C classes. Space does not permit a detailed presentation of this algorithm—however, the derivation is straightforward and produces intuitive update rules such as: µcmj = ˆ 1 ˆ N P (cl = c|Θ) N K P (cli = c|Ωi = k, D, Θ) P (Ωi = k|D, Θ) fmkj (di ) i=1 k=1 where µcmj is the mean for the cth cluster (1 ≤ c ≤ C), the mth feature (1 ≤ m ≤ M ), and the jth component of fmk (di ), and Ωi = k corresponds to ordering k for the ith object. We applied this algorithm to the data set of 150 sky objects, where unlike the results in Section 3, the algorithm now had no access to the class labels. We used the Gaussian conditional-independence model as before, and grouped the data into K = 2 clusters. Figures 4 and 5 show the highest likelihood objects, out of 150 total objects, under the Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Figure 4: The 8 objects with the highest likelihood conditioned on the model for the larger of the two clusters learned by the unsupervised algorithm. Bent−double Non−bent−double Non−bent−double Non−bent−double Non−bent−double Non−bent−double Bent−double Non−bent−double Figure 5: The 8 objects with the highest likelihood conditioned on the model for the smaller of the two clusters learned by the unsupervised algorithm. 150 Unsupervised Rank bent−doubles non−bent−doubles 100 50 0 0 50 100 150 Supervised Rank Figure 6: A scatter plot of the ranking from the unsupervised model versus that of the supervised model. models for the larger cluster and smaller cluster respectively. The larger cluster is clearly a bent-double cluster: 89 of the 150 objects are more likely to belong to this cluster under the model and 88 out of the 89 objects in this cluster have the bent-double label. In other words, the unsupervised algorithm has discovered a cluster that corresponds to “strong examples” of bent-doubles relative to the particular feature-space and model. In fact the non-bentdouble that is assigned to this group may well have been mislabelled (image not shown here). The objects in Figure 5 are clearly inconsistent with the general visual pattern of bent-doubles and this cluster consists of a mixture of non-bent-double and “weaker” bentdouble galaxies. The objects in Figures 5 that are labelled as bent-doubles seem quite atypical compared to the bent-doubles in Figure 4. A natural hypothesis is that cluster 1 (88 bent-doubles) in the unsupervised model is in fact very similar to the supervised model learned using the labelled set of 128 bent-doubles in Section 3. Indeed the parameters of the two Gaussian models agree quite closely and the similarity of the two models is illustrated clearly in Figure 6 where we plot the likelihoodbased ranks of the unsupervised model versus those of the supervised model. Both models are in close agreement and both are clearly performing well in terms of separating the objects in terms of their class labels. 5 Related Work and Future Directions A related earlier paper is Kirshner et al [6] where we presented a heuristic algorithm for solving the orientation problem for galaxies. The generalization to an EM framework in this paper is new, as is the two-level EM algorithm for clustering objects in an unsupervised manner. There is a substantial body of work in computer vision on solving a variety of different object matching problems using probabilistic techniques—see Mjolsness [7] for early ideas and Chui et al. [2] for a recent application in medical imaging. Our work here differs in a number of respects. One important difference is that we use EM to learn a model for the simultaneous correspondence of N objects, using both geometric and intensity-based features, whereas prior work in vision has primarily focused on matching one object to another (essentially the N = 2 case). An exception is the recent work of Frey and Jojic [4, 5] who used a similar EM-based approach to simultaneously cluster images and estimate a variety of local spatial deformations. The work described in this paper can be viewed as an extension and application of this general methodology to a real-world problem in galaxy classi£cation. Earlier work on bent-double galaxy classi£cation used decision tree classi£ers based on a variety of geometric and intensity-based features [3]. In future work we plan to compare the performance of this decision tree approach with the probabilistic model-based approach proposed in this paper. The model-based approach has some inherent advantages over a decision-tree model for these types of problems. For example, it can directly handle objects in the catalog with only 2 blobs or with 4 or more blobs by integrating over missing intensities and over missing correspondence information using mixture models that allow for missing or extra “blobs”. Being able to classify such con£gurations automatically is of signi£cant interest to the astronomers. Acknowledgments This work was performed under a sub-contract from the ASCI Scienti£c Data Management Project of the Lawrence Livermore National Laboratory. The work of S. Kirshner and P. Smyth was also supported by research grants from NSF (award IRI-9703120), the Jet Propulsion Laboratory, IBM Research, and Microsoft Research. I. Cadez was supported by a Microsoft Graduate Fellowship. The work of C. Kamath was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48. We gratefully acknowledge our FIRST collaborators, in particular, Robert H. Becker for sharing his expertise on the subject. References [1] R. H. Becker, R. L. White, and D. J. Helfand. The FIRST Survey: Faint Images of the Radio Sky at Twenty-cm. Astrophysical Journal, 450:559, 1995. [2] H. Chui, L. Win, R. Schultz, J. S. Duncan, and A. Rangarajan. A uni£ed feature registration method for brain mapping. In Proceedings of Information Processing in Medical Imaging, pages 300–314. Springer-Verlag, 2001. [3] I. K. Fodor, E. Cant´ -Paz, C. Kamath, and N. A. Tang. Finding bent-double radio u galaxies: A case study in data mining. In Proceedings of the Interface: Computer Science and Statistics Symposium, volume 33, 2000. [4] B. J. Frey and N. Jojic. Estimating mixture models of images and inferring spatial transformations using the EM algorithm. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1999. [5] N. Jojic and B. J. Frey. Topographic transformation as a discrete latent variable. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. ´ [6] S. Kirshner, I. V. Cadez, P. Smyth, C. Kamath, and E. Cantu-Paz. Probabilistic modelbased detection of bent-double radio galaxies. In Proceedings 16th International Conference on Pattern Recognition, volume 2, pages 499–502, 2002. [7] E. Mjolsness. Bayesian inference on visual grammars by neural networks that optimize. Technical Report YALEU/DCS/TR-854, Department of Computer Science, Yale University, May 1991. [8] R. L. White, R. H. Becker, D. J. Helfand, and M. D. Gregg. A catalog of 1.4 GHz radio sources from the FIRST Survey. Astrophysical Journal, 475:479, 1997.

5 0.53822744 190 nips-2002-Stochastic Neighbor Embedding

Author: Geoffrey E. Hinton, Sam T. Roweis

Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to deﬁne a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “ﬁnance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.

6 0.52173495 107 nips-2002-Identity Uncertainty and Citation Matching

7 0.50523937 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

8 0.41175777 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

9 0.38811138 98 nips-2002-Going Metric: Denoising Pairwise Data

10 0.36333585 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

11 0.36060035 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

12 0.33065355 45 nips-2002-Boosted Dyadic Kernel Discriminants

13 0.29777035 178 nips-2002-Robust Novelty Detection with Single-Class MPM

14 0.29477647 92 nips-2002-FloatBoost Learning for Classification

15 0.28861123 119 nips-2002-Kernel Dependency Estimation

16 0.28438056 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

17 0.28130695 194 nips-2002-The Decision List Machine

18 0.26636803 105 nips-2002-How to Combine Color and Shape Information for 3D Object Recognition: Kernels do the Trick

19 0.26575738 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

20 0.26366091 140 nips-2002-Margin Analysis of the LVQ Algorithm

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.388), (23, 0.017), (42, 0.043), (54, 0.137), (55, 0.04), (68, 0.02), (74, 0.096), (87, 0.011), (92, 0.04), (98, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94874775 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun

Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares ﬁt. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.

2 0.85868967 174 nips-2002-Regularized Greedy Importance Sampling

Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi

Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difﬁcult estimation problems arising in Markov random ﬁeld inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.

same-paper 3 0.82686418 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Author: Elzbieta Pekalska, David Tax, Robert Duin

4 0.79960012 125 nips-2002-Learning Semantic Similarity

Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor

Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1

5 0.65515226 147 nips-2002-Monaural Speech Separation

Author: Guoning Hu, Deliang Wang

Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.

6 0.61979151 41 nips-2002-Bayesian Monte Carlo

7 0.61914384 163 nips-2002-Prediction and Semantic Association

8 0.56582361 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

9 0.56114048 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

10 0.54676771 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

11 0.54574192 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

12 0.54150671 169 nips-2002-Real-Time Particle Filters

13 0.54071641 85 nips-2002-Fast Kernels for String and Tree Matching

14 0.53899789 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

15 0.53209347 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters

16 0.5263561 167 nips-2002-Rational Kernels

17 0.52577698 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

18 0.52066177 113 nips-2002-Information Diffusion Kernels

19 0.51979977 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

20 0.51899552 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs