nips nips2003 nips2003-85 knowledge-graph by maker-knowledge-mining

85 nips-2003-Human and Ideal Observers for Detecting Image Curves


Source: pdf

Author: Fang Fang, Daniel Kersten, Paul R. Schrater, Alan L. Yuille

Abstract: This paper compares the ability of human observers to detect target image curves with that of an ideal observer. The target curves are sampled from a generative model which specifies (probabilistically) the geometry and local intensity properties of the curve. The ideal observer performs Bayesian inference on the generative model using MAP estimation. Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey specific shape statistics, in particular those observed on natural shapes. Experiments are performed with data on both rectangular and hexagonal lattices. Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. This suggests a bias of human observers towards straight curves and natural statistics.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper compares the ability of human observers to detect target image curves with that of an ideal observer. [sent-7, score-1.134]

2 The target curves are sampled from a generative model which specifies (probabilistically) the geometry and local intensity properties of the curve. [sent-8, score-0.563]

3 The ideal observer performs Bayesian inference on the generative model using MAP estimation. [sent-9, score-0.472]

4 Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey specific shape statistics, in particular those observed on natural shapes. [sent-10, score-0.927]

5 Experiments are performed with data on both rectangular and hexagonal lattices. [sent-11, score-0.471]

6 Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. [sent-12, score-1.571]

7 This suggests a bias of human observers towards straight curves and natural statistics. [sent-13, score-0.936]

8 1 Introduction Detecting curves in images is a fundamental visual task which requires combining local intensity cues with prior knowledge about the probable shape of the curve. [sent-14, score-0.52]

9 ) for the shape geometry of the target curve and Pon (. [sent-18, score-0.49]

10 Sampling this model gives us semi-realistic images defined on either rectangular or hexagonal grids. [sent-21, score-0.525]

11 The human ob- Almost impossible Intermediate Easy Figure 1: It is plausible that the human visual system is adapted to the shape statistics of curves and paths in images like these. [sent-22, score-0.773]

12 Left panel illustrates the trade-off between the reliability of intensity measurements and priors on curve geometry. [sent-23, score-0.363]

13 The tent is easy to detect because of the large intensity difference between it and the background, so little prior knowledge about its shape is required. [sent-24, score-0.34]

14 Centre panel illustrates the experimental task of tracing a curve (or road) in clutter. [sent-26, score-0.24]

15 Right panel shows that the first order shape statistics from 49 object images (one datapoint per image) are clustered round P (straight) = 0. [sent-27, score-0.239]

16 server’s task is to detect the target curve and to report it by tracking it with the (computer) mouse. [sent-31, score-0.386]

17 Human performance is compared with that of an ideal observer which computes the target curve using Bayesian inference (implemented by a dynamic programming algorithm). [sent-32, score-0.786]

18 The ideal observer gives a benchmark against which human performance can be measured. [sent-33, score-0.705]

19 Poff we can explore the ability of the human visual system to detect curves under a variety of conditions. [sent-35, score-0.443]

20 In particular, we can investigate how human performance depends on the geometrical distribution PG of the curves. [sent-38, score-0.257]

21 It is plausible that the human visual system has adapted to the statistics of the natural world, see figure (1), and in particular to the geometry of salient curves. [sent-39, score-0.453]

22 Our measurements of natural image curves, see figure (1), and studies by [16], [10], [5] and [2], show distributions for shape statistics similar to those found for image intensities statistics [11, 9, 13]. [sent-40, score-0.393]

23 We therefore investigate whether human performance approaches that of the ideal when the probability distributions PG is similar to that for curves in natural images. [sent-41, score-0.74]

24 This investigation requires specifying performance measures to determine how close human performance is to the ideal (so that we can quantify whether humans do better or worse relative to the ideal for different shape distributions PG ). [sent-42, score-1.072]

25 The second measure computes the value of the posterior distribution for the curves detected by the human and the ideal and takes the logarithm of their ratio. [sent-45, score-0.641]

26 The experiments are performed by human observers who are required to trace the target curve in the image. [sent-47, score-0.737]

27 We simulated the images first on a rectangle grid and then on a hexagonal grid to test the generality of the results. [sent-48, score-0.454]

28 In these experiments we varied the probability distributions of the geometry PG and the distribution Pon of the intensity on the target curve to allow us to explore a range of different conditions (we kept the distribution P off fixed). [sent-49, score-0.604]

29 Centre panel: a pyramid structure used in the simulations on the rectangular grid. [sent-57, score-0.251]

30 Right panel: Typical distributions of Pon , Poff tion (3) describes our probabilistic model and specifies the ideal observer. [sent-58, score-0.326]

31 Sections (5,6) describe experimental results on rectangular and hexagonal grids respectively in terms of our two performance measures. [sent-60, score-0.538]

32 2 Previous Work Previous psychophysical studies have shown conditions for which the human visual system is able to effectively group contour fragments when embedded in an array of distracting fragments [3, 8]. [sent-61, score-0.537]

33 For example, it is known that the degree to which a target contour “pops out” depends on the degree of similarity of the orientation of neighboring fragments (typically gabor patches) [3], and that global closure facilitates grouping [8]. [sent-63, score-0.406]

34 Recently, several researchers have shown that psychophysical performance for contour grouping may be understood in terms of the statistical properties of natural contours [12, 5]. [sent-64, score-0.38]

35 For example, Geisler [5] has shown that human contour detection for line segments can be quantitatively predicted from a local grouping rule derived from measurements of local edge statistics. [sent-65, score-0.467]

36 However, apart from studies that manipulate the contrast of gabor patch tokens [4], there has been little work on how intensity and contour geometry information is combined by the visual system under conditions that begin to approximate those of natural contours. [sent-66, score-0.573]

37 In this paper we attempt to fill this gap by using stimuli sampled from a generative model which enables us to quantitatively characterize the shape and intensity information available for detecting curves and compare human performance with that of an ideal detector. [sent-67, score-0.98]

38 Following [6], we formulate target curve detection as tree search, see figure (2), through a Q-nary tree. [sent-69, score-0.333]

39 A target curve hypothesis consists of a set of connected straight-line segments called segments. [sent-71, score-0.27]

40 For example, the simplest case sets Q = 3 with an alphabet a 1 , a2 , a3 corresponding to the decisions: (i) a1 – go straight (0 degrees), (ii) a2 – go left (-5 degrees), or (iii) a3 – go right (+ 5 degrees). [sent-74, score-0.327]

41 Our analysis of image curve statistics suggests that P (straight) = 0. [sent-84, score-0.225]

42 The quantity λon − λof f is a measure of the local intensity contrast of the target contour and so we informally refer to it as the signal-to-noise ratio (SNR). [sent-95, score-0.492]

43 The Ideal Observer estimates the target curve trajectory by MAP estimation (which we compute using dynamic programming). [sent-96, score-0.27]

44 We implement this model on both rectangular and hexagonal lattices (the hexagonal latttices equate for contrast at borders, and are visually more realistic). [sent-102, score-0.77]

45 For a rectangular lattice, the easiest way to do this involves defining a pyramid where paths start at the apex and the only allowable “moves” are: (i) one step down, (ii) one step down and one step left, and (iii) one step down and one step right. [sent-104, score-0.362]

46 A similar procedure is used on the hexagonal lattice. [sent-106, score-0.258]

47 But for certain geometry probabilities we observed that the sampled curves had “clumping” where the path consists of a large number of zig-zags. [sent-107, score-0.356]

48 To obtain computer simulations of target curves in background clutter we proceed in two stages. [sent-111, score-0.275]

49 In the first stage, we stochastically sample from the distribution P G (t) to produce a target curve in the pyramid (starting at the apex and moving downwards). [sent-112, score-0.342]

50 So if a pixel x is on or off the target curve (which we generated in the first stage) then we sample the intensity I(x) from the distribution Pon (I) or Pof f (I) respectively. [sent-114, score-0.418]

51 4 Order Parameters and Performance Measures Yuille et al [14],[15] analyzed the Geman and Jedynak model [6] to determine how the ability to detect the target curve depended on the geometry Pg and the intensity properties Pon . [sent-115, score-0.636]

52 The analysis showed that the ability to detect the target curve behaves as e −KN , where N is the length of the curve and K is an order parameter. [sent-117, score-0.486]

53 If K > 0 then detecting the target curve is possible but if K < 0 then it becomes impossible to find it (informally, it becomes like looking for a needle in a haystack). [sent-120, score-0.333]

54 The order parameter illustrates the trade-off between shape and intensity cues and determines which types of curves are easiest to detect by an ideal observer. [sent-121, score-0.836]

55 The intensity cues are quantified by D(Pon ||Pof f ) and the shape cues by D(PG ||U ). [sent-122, score-0.352]

56 The easiest curves to detect are those which are straight lines (i. [sent-123, score-0.491]

57 The hardest curves to detect are those for which the geometry is most random. [sent-126, score-0.361]

58 So when comparing human performance to ideal observers we have to take into account that some types of curves are inherent easier to detect (i. [sent-130, score-1.012]

59 Human observers are good at detecting straight line curves but so are ideal obervers. [sent-133, score-0.989]

60 We need performance measures to quantify the relative effectiveness of human and ideal observers. [sent-134, score-0.581]

61 Otherwise, we will not be able to conclude that human observers are biased towards particular curve shapes (such as those occuring in natural images). [sent-135, score-0.672]

62 We now define two performance measures to quantify the relative effectivenes of human and ideal observers. [sent-136, score-0.581]

63 Our first measure is based on the hypothesis that human observers have an “effective order parameter”. [sent-137, score-0.496]

64 In other words, their performance on the target curve tracking task behaves like e−N KH where KH is an effective order parameter which difference from the true order parameter K might reflect a human bias towards straight lines or ecological shape priors. [sent-138, score-1.009]

65 We estimate the effective order parameters by fixing P G , Poff and adjusting Pon until the observers achieve a fixed performance level of at most 5 errors on I H a path of length 32. [sent-139, score-0.427]

66 This gives distributions Pon , Pon for the ideal and human observers H I H I respectively. [sent-140, score-0.793]

67 Then we set KH = K − D(Pon ||Pof f ) + D(Pon ||Pof f ), where Pon , Pon are the distributions used by the human and the ideal (respectively) to achieve similar performance. [sent-141, score-0.515]

68 The experimental criterion that the target path be found with 5 or less errors, see section (5), was not included in the theoretical analysis [14],[15]. [sent-144, score-0.228]

69 Also some small corrections need to be made to the order parameters due to the nature of the rectangular grid, see [15] for computer calculations of the size of these corrections. [sent-145, score-0.213]

70 This motivates a second performance measure where we calculate the value of the posterior probability (proportional to the exponential of r in equation (1)) for the curve detected by the human and the ideal observer (for identical distributions PG , Pon , Poff ). [sent-147, score-0.918]

71 5 Experimental Results on Rectangular Grid To assess human performance on the road tracking task, we first had a set of 7 observers find the target curve in a tree defined by a rectangular grid figure (3)A. [sent-150, score-1.155]

72 The observer tracked the Stimulus Rectangular grd i B Stimulus Human Ideal Human C & %& $ # "  ! [sent-151, score-0.217]

73 contour by starting at the far left corner and making a series of 32 key presses that moved the observer’s tracked contour either left, right, or straight at each key press. [sent-161, score-0.568]

74 Each contour estimate was scored by counting the number of positions the observer’s contour was off the true path. [sent-162, score-0.318]

75 Each observer had a training period in which the observer was shown examples of contours produced from the four different geometry distributions and practiced tracing in noise. [sent-163, score-0.628]

76 During an experimental session, the geometry distribution was fixed at one the four possible values and observers were told which geometry distribution was being used to generate the contours. [sent-164, score-0.581]

77 The parameter λon of Pon was varied using an adaptive procedure until the human observer managed to repeatedly detect the target curve with at most five misclassified pixels. [sent-165, score-0.729]

78 The thresholds for 7 observers and the ideal observer are shown in figure 4. [sent-169, score-0.75]

79 These thresholds can be used to calculate our first performance measure (∆K) and determine how effectively observers are using the available image information at each P (straight). [sent-170, score-0.385]

80 The results are illustrated in figure (4)B where the human data was averaged over seven subjects. [sent-171, score-0.218]

81 We next compute our second performance measure (for which P on , Pof f , PG are the same for the ideal and the human observer). [sent-176, score-0.542]

82 The average difference of this performance measure for the each geometry distribution is an alternative way how well observers are using the intensity information as a function of geometry, with a zero difference indicating optimal use of the information. [sent-177, score-0.639]

83 We conclude that our results are consistent either with a bias to ecological statistics or to straight lines. [sent-182, score-0.38]

84 The difference between human and ideal K order parameters. [sent-192, score-0.469]

85 The average reward difference between ideal and human observers. [sent-194, score-0.469]

86 We performed experiments on the hexagonal lattice under four different probabilities for the geometry. [sent-197, score-0.329]

87 We show examples of the stimuli, the ideal results (indicated by dotted path), and the human results (indicated by dotted path) for the Clumping amd No-Clumping cases in figure (4B & C), respectively. [sent-207, score-0.469]

88 The average reward difference, ∆r = r ideal − rhuman , results for Clumping and No Clumping are summarized in figure (4F & I). [sent-210, score-0.28]

89 Both performance measures give consistent results for the Clumping data suggesting that humans are best when detecting the straightest lines (P (straight) = 0. [sent-211, score-0.224]

90 But the situation is more complicated for the No Clumping case where human observers show preferences for P (straight) = 0. [sent-213, score-0.467]

91 7 Summary and Conclusions The results of our experiments suggest that humans are most effective at detecting curves which are straight or which obey ecological statistics. [sent-216, score-0.595]

92 Our two performance measures were not always consistent, particularly for the rectangular grid (we are analyzing this discrepency theoretically). [sent-218, score-0.37]

93 The first measure suggested a bias towards ecological statistics on the rectangular grid and for No Clumping stimuli on the hexagonal grid. [sent-219, score-0.788]

94 The second measure showed a bias towards curves with P (straight) = 0. [sent-220, score-0.235]

95 To our knowledge, this is the first experiment which tests the performance of human observers for detecting target curves by comparison with that of an ideal observer with ambiguous intensity data. [sent-222, score-1.492]

96 Our novel experimental design and stimuli may cause artifacts due to the rectangular and hexagonal grids. [sent-223, score-0.527]

97 Further experiments performed on a larger number of subjects may be able to isolated more precisely the strategy that human observers employ. [sent-225, score-0.467]

98 Contour integration by the human visual system: evidence for a local “association field”. [sent-249, score-0.222]

99 The roles of polarity and symmetry in the perceptual grouping of contour fragments. [sent-257, score-0.236]

100 Edge co-occurrence in natural images predicts contour grouping performance. [sent-269, score-0.305]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pon', 0.372), ('ideal', 0.28), ('observers', 0.278), ('clumping', 0.274), ('hexagonal', 0.258), ('straight', 0.225), ('rectangular', 0.213), ('observer', 0.192), ('pof', 0.189), ('human', 0.189), ('pg', 0.177), ('contour', 0.159), ('intensity', 0.148), ('curves', 0.143), ('geometry', 0.14), ('curve', 0.138), ('target', 0.132), ('ti', 0.109), ('shape', 0.08), ('detect', 0.078), ('yuille', 0.074), ('path', 0.073), ('lattice', 0.071), ('grid', 0.071), ('jedynak', 0.069), ('ecological', 0.068), ('detecting', 0.063), ('cues', 0.062), ('geman', 0.06), ('po', 0.057), ('gure', 0.055), ('grouping', 0.054), ('psychophysical', 0.054), ('images', 0.054), ('statistics', 0.053), ('panel', 0.052), ('hess', 0.051), ('minneapolis', 0.051), ('distributions', 0.046), ('easiest', 0.045), ('gestalt', 0.045), ('kh', 0.045), ('minnesota', 0.045), ('performance', 0.044), ('measures', 0.042), ('lattices', 0.041), ('humans', 0.041), ('detection', 0.04), ('pyramid', 0.038), ('tracking', 0.038), ('natural', 0.038), ('fragments', 0.036), ('image', 0.034), ('apex', 0.034), ('fang', 0.034), ('hayes', 0.034), ('kersten', 0.034), ('schrater', 0.034), ('straightest', 0.034), ('tent', 0.034), ('sci', 0.034), ('bias', 0.034), ('visual', 0.033), ('stimuli', 0.033), ('psychology', 0.033), ('mn', 0.032), ('paths', 0.032), ('effective', 0.032), ('contours', 0.031), ('xi', 0.03), ('coughlan', 0.03), ('angeles', 0.03), ('studies', 0.03), ('seven', 0.029), ('road', 0.029), ('towards', 0.029), ('measure', 0.029), ('field', 0.027), ('tracing', 0.027), ('geisler', 0.027), ('alphabet', 0.027), ('quantify', 0.026), ('stimulus', 0.026), ('go', 0.025), ('gabor', 0.025), ('tracked', 0.025), ('measurements', 0.025), ('snr', 0.024), ('geometrical', 0.024), ('laws', 0.024), ('informally', 0.024), ('moves', 0.024), ('tree', 0.023), ('experimental', 0.023), ('perceptual', 0.023), ('obey', 0.023), ('acad', 0.023), ('natl', 0.023), ('ambiguous', 0.023), ('degrees', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

Author: Fang Fang, Daniel Kersten, Paul R. Schrater, Alan L. Yuille

Abstract: This paper compares the ability of human observers to detect target image curves with that of an ideal observer. The target curves are sampled from a generative model which specifies (probabilistically) the geometry and local intensity properties of the curve. The ideal observer performs Bayesian inference on the generative model using MAP estimation. Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey specific shape statistics, in particular those observed on natural shapes. Experiments are performed with data on both rectangular and hexagonal lattices. Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. This suggests a bias of human observers towards straight curves and natural statistics.

2 0.12197485 81 nips-2003-Geometric Analysis of Constrained Curves

Author: Anuj Srivastava, Washington Mio, Xiuwen Liu, Eric Klassen

Abstract: We present a geometric approach to statistical shape analysis of closed curves in images. The basic idea is to specify a space of closed curves satisfying given constraints, and exploit the differential geometry of this space to solve optimization and inference problems. We demonstrate this approach by: (i) defining and computing statistics of observed shapes, (ii) defining and learning a parametric probability model on shape space, and (iii) designing a binary hypothesis test on this space. 1

3 0.1016236 5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

4 0.084379703 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

Author: Darya Chudova, Christopher Hart, Eric Mjolsness, Padhraic Smyth

Abstract: We propose a functional mixture model for simultaneous clustering and alignment of sets of curves measured on a discrete time grid. The model is specifically tailored to gene expression time course data. Each functional cluster center is a nonlinear combination of solutions of a simple linear differential equation that describes the change of individual mRNA levels when the synthesis and decay rates are constant. The mixture of continuous time parametric functional forms allows one to (a) account for the heterogeneity in the observed profiles, (b) align the profiles in time by estimating real-valued time shifts, (c) capture the synthesis and decay of mRNA in the course of an experiment, and (d) regularize noisy profiles by enforcing smoothness in the mean curves. We derive an EM algorithm for estimating the parameters of the model, and apply the proposed approach to the set of cycling genes in yeast. The experiments show consistent improvement in predictive power and within cluster variance compared to regular Gaussian mixtures. 1

5 0.083468862 168 nips-2003-Salient Boundary Detection using Ratio Contour

Author: Song Wang, Toshiro Kubota, Jeffrey M. Siskind

Abstract: This paper presents a novel graph-theoretic approach, named ratio contour, to extract perceptually salient boundaries from a set of noisy boundary fragments detected in real images. The boundary saliency is defined using the Gestalt laws of closure, proximity, and continuity. This paper first constructs an undirected graph with two different sets of edges: solid edges and dashed edges. The weights of solid and dashed edges measure the local saliency in and between boundary fragments, respectively. Then the most salient boundary is detected by searching for an optimal cycle in this graph with minimum average weight. The proposed approach guarantees the global optimality without introducing any biases related to region area or boundary length. We collect a variety of images for testing the proposed approach with encouraging results. 1

6 0.07318522 7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd

7 0.06642016 67 nips-2003-Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina

8 0.063296765 119 nips-2003-Local Phase Coherence and the Perception of Blur

9 0.062821247 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

10 0.062544882 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

11 0.060664825 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

12 0.057766784 100 nips-2003-Laplace Propagation

13 0.054912493 12 nips-2003-A Model for Learning the Semantics of Pictures

14 0.054202855 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

15 0.054029096 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

16 0.052448172 133 nips-2003-Mutual Boosting for Contextual Inference

17 0.051260132 51 nips-2003-Design of Experiments via Information Theory

18 0.05005433 30 nips-2003-Approximability of Probability Distributions

19 0.048139852 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression

20 0.046889238 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.157), (1, -0.01), (2, 0.071), (3, -0.031), (4, -0.125), (5, -0.063), (6, 0.077), (7, -0.01), (8, -0.045), (9, 0.006), (10, -0.001), (11, 0.108), (12, 0.024), (13, 0.052), (14, 0.009), (15, 0.001), (16, 0.036), (17, -0.093), (18, 0.035), (19, 0.047), (20, -0.013), (21, -0.062), (22, -0.049), (23, -0.022), (24, -0.034), (25, 0.065), (26, 0.154), (27, -0.089), (28, -0.008), (29, 0.197), (30, -0.202), (31, -0.077), (32, 0.064), (33, 0.023), (34, -0.039), (35, 0.014), (36, -0.03), (37, -0.082), (38, 0.018), (39, -0.074), (40, -0.017), (41, -0.058), (42, 0.051), (43, -0.055), (44, -0.174), (45, 0.045), (46, -0.025), (47, 0.091), (48, -0.102), (49, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97355795 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

Author: Fang Fang, Daniel Kersten, Paul R. Schrater, Alan L. Yuille

Abstract: This paper compares the ability of human observers to detect target image curves with that of an ideal observer. The target curves are sampled from a generative model which specifies (probabilistically) the geometry and local intensity properties of the curve. The ideal observer performs Bayesian inference on the generative model using MAP estimation. Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey specific shape statistics, in particular those observed on natural shapes. Experiments are performed with data on both rectangular and hexagonal lattices. Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. This suggests a bias of human observers towards straight curves and natural statistics.

2 0.58046687 81 nips-2003-Geometric Analysis of Constrained Curves

Author: Anuj Srivastava, Washington Mio, Xiuwen Liu, Eric Klassen

Abstract: We present a geometric approach to statistical shape analysis of closed curves in images. The basic idea is to specify a space of closed curves satisfying given constraints, and exploit the differential geometry of this space to solve optimization and inference problems. We demonstrate this approach by: (i) defining and computing statistics of observed shapes, (ii) defining and learning a parametric probability model on shape space, and (iii) designing a binary hypothesis test on this space. 1

3 0.56272 168 nips-2003-Salient Boundary Detection using Ratio Contour

Author: Song Wang, Toshiro Kubota, Jeffrey M. Siskind

Abstract: This paper presents a novel graph-theoretic approach, named ratio contour, to extract perceptually salient boundaries from a set of noisy boundary fragments detected in real images. The boundary saliency is defined using the Gestalt laws of closure, proximity, and continuity. This paper first constructs an undirected graph with two different sets of edges: solid edges and dashed edges. The weights of solid and dashed edges measure the local saliency in and between boundary fragments, respectively. Then the most salient boundary is detected by searching for an optimal cycle in this graph with minimum average weight. The proposed approach guarantees the global optimality without introducing any biases related to region area or boundary length. We collect a variety of images for testing the proposed approach with encouraging results. 1

4 0.45179561 67 nips-2003-Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina

Author: Matthias H. Hennig, Florentin Wörgötter

Abstract: Even under perfect fixation the human eye is under steady motion (tremor, microsaccades, slow drift). The “dynamic” theory of vision [1, 2] states that eye-movements can improve hyperacuity. According to this theory, eye movements are thought to create variable spatial excitation patterns on the photoreceptor grid, which will allow for better spatiotemporal summation at later stages. We reexamine this theory using a realistic model of the vertebrate retina by comparing responses of a resting and a moving eye. The performance of simulated ganglion cells in a hyperacuity task is evaluated by ideal observer analysis. We find that in the central retina eye-micromovements have no effect on the performance. Here optical blurring limits vernier acuity. In the retinal periphery however, eye-micromovements clearly improve performance. Based on ROC analysis, our predictions are quantitatively testable in electrophysiological and psychophysical experiments. 1

5 0.42709082 5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

6 0.42293841 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

7 0.41448319 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

8 0.4073509 140 nips-2003-Nonlinear Processing in LGN Neurons

9 0.39878944 30 nips-2003-Approximability of Probability Distributions

10 0.39069027 53 nips-2003-Discriminating Deformable Shape Classes

11 0.37895501 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

12 0.35803941 51 nips-2003-Design of Experiments via Information Theory

13 0.35145843 75 nips-2003-From Algorithmic to Subjective Randomness

14 0.32903689 196 nips-2003-Wormholes Improve Contrastive Divergence

15 0.31470919 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

16 0.30688867 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons

17 0.29895139 12 nips-2003-A Model for Learning the Semantics of Pictures

18 0.28821933 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

19 0.28776664 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors

20 0.27899566 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (11, 0.035), (20, 0.011), (29, 0.012), (30, 0.011), (35, 0.052), (51, 0.32), (53, 0.101), (68, 0.011), (69, 0.02), (71, 0.087), (76, 0.038), (85, 0.079), (91, 0.09), (99, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8295927 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

Author: Fang Fang, Daniel Kersten, Paul R. Schrater, Alan L. Yuille

Abstract: This paper compares the ability of human observers to detect target image curves with that of an ideal observer. The target curves are sampled from a generative model which specifies (probabilistically) the geometry and local intensity properties of the curve. The ideal observer performs Bayesian inference on the generative model using MAP estimation. Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey specific shape statistics, in particular those observed on natural shapes. Experiments are performed with data on both rectangular and hexagonal lattices. Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. This suggests a bias of human observers towards straight curves and natural statistics.

2 0.67725801 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1

3 0.5098744 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

4 0.50526243 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall

Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.

5 0.50464332 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

Author: Sanjiv Kumar, Martial Hebert

Abstract: In this paper we present Discriminative Random Fields (DRF), a discriminative framework for the classification of natural image regions by incorporating neighborhood spatial dependencies in the labels as well as the observed data. The proposed model exploits local discriminative models and allows to relax the assumption of conditional independence of the observed data given the labels, commonly used in the Markov Random Field (MRF) framework. The parameters of the DRF model are learned using penalized maximum pseudo-likelihood method. Furthermore, the form of the DRF model allows the MAP inference for binary classification problems using the graph min-cut algorithms. The performance of the model was verified on the synthetic as well as the real-world images. The DRF model outperforms the MRF model in the experiments. 1

6 0.50424081 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

7 0.50332296 113 nips-2003-Learning with Local and Global Consistency

8 0.50230032 143 nips-2003-On the Dynamics of Boosting

9 0.50161707 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

10 0.50083625 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

11 0.50075442 3 nips-2003-AUC Optimization vs. Error Rate Minimization

12 0.4993605 41 nips-2003-Boosting versus Covering

13 0.4991779 145 nips-2003-Online Classification on a Budget

14 0.49896923 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

15 0.49879524 30 nips-2003-Approximability of Probability Distributions

16 0.49813589 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

17 0.49810487 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

18 0.4978143 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

19 0.49779314 189 nips-2003-Tree-structured Approximations by Expectation Propagation

20 0.49761227 78 nips-2003-Gaussian Processes in Reinforcement Learning