nips nips2002 nips2002-174 knowledge-graph by maker-knowledge-mining

174 nips-2002-Regularized Greedy Importance Sampling


Source: pdf

Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi

Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca   ¡ Abstract Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. [sent-3, score-1.36]

2 Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. [sent-4, score-0.191]

3 In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. [sent-5, score-0.683]

4 The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. [sent-6, score-0.123]

5 In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes. [sent-7, score-0.144]

6 1 Introduction Many inference problems in graphical models can be cast as determining the expected value of a random variable of interest, , given observations drawn according to a target distribution . [sent-8, score-0.151]

7 The family of stochastic inference methods can be grouped into the independent Monte Carlo methods (importance sampling and rejection sampling [7, 4]) and the dependent Markov Chain Monte Carlo (MCMC) methods (Gibbs sampling, Metropolis sampling, and Hybrid Monte Carlo) [7, 5, 8, 14]. [sent-16, score-0.469]

8 The goal of all these methods is to simulate drawing a random sample from a target distribution defined by a graphical model that is hard to sample from directly. [sent-17, score-0.196]

9  £ £ £ 0    £ In this paper we improve the greedy importance sampling (GIS) technique introduced in [12, 11]. [sent-20, score-0.603]

10 GIS attempts to improve the variance of importance sampling by explicitly searching for important regions in the target distribution . [sent-21, score-0.626]

11 Previous work has shown that search £ can be incorporated in an importance sampler while maintaining unbiasedness, leading to improved estimation in simple problems. [sent-22, score-0.367]

12 However, the drawbacks of the previous GIS method are that it has free parameters whose settings affect estimation performance, and its importance weights are directed at achieving unbiasedness without necessarily being directed at reducing variance. [sent-23, score-0.522]

13 In this paper, we introduce a new, parameterless form of greedy importance sampling that performs comparably to the previous method given its best parameter settings. [sent-24, score-0.573]

14 We then introduce a new weight calculation scheme that preserves unbiasedness, but provides further variance reduction by “regularizing” the contributions each search path gives to the estimator. [sent-25, score-0.361]

15 Below we first review the generalized importance sampling procedure that forms the core of our estimators before describing the innovations that lead to improved estimators. [sent-27, score-0.625]

16 2 Generalized importance sampling Importance sampling is a useful technique for estimating when cannot be sampled from directly. [sent-28, score-0.701]

17 The basic idea is to draw independent points according to a simple proposal distribution but then weight the points according to . [sent-29, score-0.299]

18 1 The unbiasedness of this procedure is easy to establish, since for a random variable the expected weighted value of under is . [sent-31, score-0.311]

19 ) The main difficulty with importance sampling is that even though it is an effective estimation technique when approximates over most of the domain, it performs poorly when does not have reasonable mass in high probability regions of . [sent-33, score-0.507]

20 A mismatch of this type results in a high variance estimator since the sample will almost always contains unrepresentative points but will intermittently be dominated by a few high weight points. [sent-34, score-0.323]

21 The idea behind greedy importance sampling (GIS) [11, 12] is to avoid generating under-weight samples by explicitly searching for significant regions in the target distribution . [sent-35, score-0.672]

22 Here, to each domain point we associate a fixed block , where is the length of block . [sent-47, score-0.307]

23 When is drawn from the proposal distribution we recover block and add the block points to the sample. [sent-48, score-0.362]

24 2 Ensuring unbiasedness then reduces to weighting the sampled points appropriately. [sent-49, score-0.367]

25 To this end, [12] introduces an auxiliary weighting scheme that can be used to obtain unbiased estimates: To each pair of points , (such that ) one , where intuitively is the weight that initiating point associates a weight assigns to sample point in its block . [sent-50, score-0.725]

26 However it is still possible to apply the “indirect” importance sampling procedure shown in Figure 1 by assigning and renormalizing. [sent-55, score-0.483]

27 The drawback of the indirect procedure is indirect weights that it is no longer unbiased at small sample sizes, but instead only becomes unbiased in the large sample limit [4]. [sent-56, score-0.607]

28 To keep the presentation simple we will focus on the “direct” form of importance sampling described in Figure 1 and establish unbiasedness for that case—keeping in mind that every extended form of importance sampling we discuss below can be converted to an “indirect” form. [sent-57, score-1.137]

29 2 There is no restriction on the blocks other than that they be finite—blocks can overlap and need not even contain their initiating point —however their union has to cover the sample space , and cannot put zero probability on initiating points which leaves sample points uncovered. [sent-58, score-0.453]

30 according to Weight each point by   where Estimate “Generalized” importance sampling Draw indep. [sent-67, score-0.476]

31 Create a large sample out of the blocks d “Direct” importance sampling Draw indep. [sent-70, score-0.582]

32 96 ¦ 6 8$7  2 0 (6 U   6 4 2 0( '6 ¦ 1'6 ¦ 5316) " S Q HRP  ¦   X I U TI   Figure 1: Basic importance sampling procedures as they satisfy (1) e C  e f  H ¢ ( c C '   C  ¢ ' ! [sent-75, score-0.519]

33 In fact, it is quite easy to prove that this yields unbiased estimates [12] since the expected weighted value of when sampling initiating under is g h  H ¢ (! [sent-80, score-0.403]

34 S P RP 6 Tt¤ Rr'6 ¦   v Q © Q u sU w u s U q    ¤¦ U  Crucially, this argument does not depend on how the block decomposition is chosen or how the -weights are set, so long as they satisfy (1). [sent-85, score-0.156]

35 That is, one could fix any block decomposition and weighting scheme, even one that depends on the target distribution and random variable , without affecting the unbiasedness of the procedure. [sent-86, score-0.455]

36 Intuitively, this works because the block structure and weighting scheme are fixed a priori, and unbiasedness is achieved by sampling blocks and assigning fair weights to the points. [sent-87, score-0.788]

37 The generality of this outcome allows one to consider using a wide range of alternative importance sampling schemes, while employing appropriate -weights to cancel any bias. [sent-88, score-0.444]

38 In particular, we will determine blocks on-line by following deterministic greedy search paths. [sent-89, score-0.334]

39 G £ ¢ G 3 Parameter-free greedy importance sampling Our first contribution in this paper is to derive an efficient greedy importance sampling (GIS) procedure that involves no free parameters, unlike the proposal in [12]. [sent-90, score-1.256]

40 One key motivating principle behind GIS is to realize that the optimal proposal distribution for estimating with standard importance sampling is , which minimizes the resulting variance [10]. [sent-91, score-0.621]

41 GIS attempts to overcome a poor proposal distribution by explicitly searching for points that maximally increase the objective (Figure 2). [sent-92, score-0.23]

42 If this can be achieved, the resulting GIS procedure will be unbiased via the arguments of the previous section. [sent-94, score-0.176]

43   © “Greedy” importance sampling Draw independently from . [sent-118, score-0.444]

44 If the search is deterministic (which we assume) then the set of search paths entering will form a tree. [sent-124, score-0.245]

45 In principle, the tree will have unbounded depth since the greedy search procedure does not stop until it has reached a local maximum. [sent-126, score-0.301]

46 Therefore, to ensure we distribute by a convergent series; weight down the tree from level (the root, ) to levels where for simplicity we set the total weight allocated at level , , to be . [sent-127, score-0.199]

47 Given the entire search tree this would be trivial, but the greedy search paths will typically provide only a single branch of the tree. [sent-133, score-0.387]

48 Then, following the path to a desired point , we successively divide the remaining weight at each point by the observed branching factor , , etc. [sent-136, score-0.249]

49 This scheme is efficient to compute because we require only the branching factors along a given search path to correctly allocate the weight. [sent-139, score-0.277]

50 This yields the following weighting scheme that runs in linear time and exactly satisfies the constraint (1): Given a start point and a search from to , we assign a weight by path   ' © '  g  ' ' B  H ¢ a! [sent-140, score-0.38]

51 Therefore, the new -weighting scheme provides be used to show that an efficient unbiased method for implementing GIS that does not use any free parameters. [sent-143, score-0.226]

52 G e   D ¢ © G 6 C '  ©  S T( ' 7 4 Variance reduction While GIS reduces variance by searching, the -weight correction scheme outlined above is designed only to correct bias and does not specifically address variance issues. [sent-144, score-0.292]

53 In fact, one can exploit this additional flexibility to determine minimum variance unbiased estimators in simple cases. [sent-147, score-0.336]

54 Assume the search is constrained to move between adjacent points so that from every initial point the greedy search will move to the right until it hits point . [sent-149, score-0.434]

55 Any -weighting scheme for this domain can be expressed as a matrix, , shown in Figure 2, where row corresponds to the search block retrieved by starting at point . [sent-150, score-0.347]

56 However, it is the rows of that correspond to search blocks sampled during estimation. [sent-152, score-0.179]

57 If we assume a uniform proposal distribution then gives the column vector of block estimates that correspond to each start point. [sent-153, score-0.19]

58 The variance of the overall estimator then becomes equal to the variance of the column vector . [sent-154, score-0.233]

59 Thus, the unbiasedness constraints behave orthogonally to the zero variance constraints: unbiasedness imposes a constraint on columns of whereas zero variance imposes a constraint on rows of . [sent-157, score-0.776]

60 Since there are constraints in total and variables, one can apparently solve for a zero variance unbiased estimator (for ). [sent-159, score-0.287]

61 Nevertheless, one can obtain an optimal GIS estimator by solving a quadratic program for the which minimizes variance subject to satisfying the linear unbiasedness constraints. [sent-161, score-0.399]

62 (Although the above discussion applies to any finite domain—all one needs to do is encode the search topology in the weight matrix . [sent-163, score-0.161]

63 ) Rather, the point is to show that a significant amount of flexibility remains in setting the -weights—even after the unbiasedness constraints have been satisfied—and that this additional flexibility can be exploited to reduce variance. [sent-164, score-0.281]

64  G We can now extend these ideas to a more realistic, general situation: To reduce the variance of the GIS estimator developed in Section 3, our idea is to equalize the block totals among different search paths. [sent-165, score-0.487]

65 The main challenge is to adjust -weights in a way that equalizes block totals without introducing bias, and without requiring excessive computational overhead. [sent-166, score-0.19]

66 First note that when traversing a path from to , the blocks sampled by GIS produce estimates of the . [sent-168, score-0.137]

67 This point will have been arrived at via some predecessor , but we could via any one of its possible predecessors . [sent-170, score-0.175]

68 We would like to equalize have arrived at the block totals that would have been obtained by arriving via any one of these predecessor points. [sent-171, score-0.33]

69 The key to maintaining unbiasedness is to ensure that any weight calculation performed at a point in a search tree is consistent, regardless of the path taken to reach that point. [sent-172, score-0.533]

70 Since we cannot anticipate the initial points, it is only convenient to equalize the subtotals from the predecessors , through , and up to the root . [sent-173, score-0.146]

71 We equalize the different predecessor totals by determining factors which satisfy the constraints G S T( S 6( ' a ¡ C D ' &  ' @ T( S 6 § © ! [sent-177, score-0.223]

72 The equalization and unbiasedness constraints form a linear system whose solution we rescale to obtain positive . [sent-185, score-0.249]

73 Importantly, at a given search point, any of its predecessors will calculate the same -correction scheme locally, regardless of which predecessor S T( ' 2( &  S T( ' # 3' G ¡ # ' @ T( S ' ' ' §¦¤ ¤¤ 0 1( ' H' ¡ ( D' ' is actually sampled. [sent-188, score-0.277]

74 It is easy to prove that any fixed -weighting scheme that satisfies , and is applied to an unbiased -weighting, will satisfy (1). [sent-190, score-0.239]

75 The benefit of this scheme is that it reduces variance while preserving unbiasedness. [sent-191, score-0.175]

76 D F6 S T( ' 7  5 Empirical results: Markov random field estimation To investigate the utility of the GIS estimators we conducted experiments on inference problems in Markov random fields. [sent-193, score-0.256]

77 Typically, standard MCMC methods such as Gibbs sampling and Metropolis sampling are applied to such problems, but their success is limited owing to the fact that these estimators tend to get trapped in local modes [7]. [sent-196, score-0.662]

78 Standard importance sampling is also a poor estimation strategy for these models because a simple proposal distribution (like uniform) has almost no chance of sampling in relevant regions of the target distribution [7]. [sent-198, score-0.811]

79 Explicitly searching for modes would seem to provide an effective estimation strategy for these problems. [sent-199, score-0.138]

80 ¢ ' We conducted experiments by fixing a model and temperature and ran the estimators for a fixed amount of CPU time. [sent-218, score-0.233]

81 Each estimator was re-run 1000 times to estimate their root mean squared error (RMSE) on small models where exact answers could be calculated, or standard deviation (STD) on large models where no such exact answer is feasible. [sent-219, score-0.16]

82 We compared estimators by controlling their run time (given a reasonable C implementation) not just their sample size, because the different estimators use different computational overheads, and run time is the only convenient way to draw a fair comparison. [sent-220, score-0.368]

83 For example, GIS methods require a substantial amount of additional computation to find the greedy search 4 This variance reduction scheme applies naturally to unbiased direct estimators. [sent-221, score-0.508]

84 Therefore, for indirect GIS we employ an alternative -weighting scheme that attempts to maximize total block weight. [sent-223, score-0.278]

85 5 Interesting recent progress has been made on developing exact and approximate sampling methods for the special case of Ising models [9, 15, 13]. [sent-224, score-0.232]

86 paths and calculate inward branching factors, and consequently they must use substantially smaller sample sizes than their counterparts to ensure a fair comparison. [sent-319, score-0.217]

87 However, the GIS estimators still seem to obtain reasonable results despite their sample size disadvantage. [sent-320, score-0.169]

88 not For the GIS procedures we implemented a simple search that only ascends in , and we only used a uniform proposal distribution in all our experiments. [sent-321, score-0.203]

89 We also only report results for the indirect versions of all importance samplers (cf. [sent-322, score-0.37]

90 Table 3 shows results for estigeneralized Ising model when temperature is dropped mating expected energy in an from 1. [sent-326, score-0.167]

91 Standard importance sampling (IS) is a poor estimator in this domain, even when it is able to use 4. [sent-330, score-0.544]

92 ¡ §¦  Next, to compare the importance sampling approaches to the MCMC methods, we see the dramatic effect of temperature reduction. [sent-335, score-0.561]

93 Owing to their simplicity (and an efficient implementation), the MCMC samplers were able to gather about 20 to 30 times as many data points as the GIS estimators in the same amount of time. [sent-336, score-0.205]

94 However, as the temperature is lowered, a well known effect takes hold as the the low energy configurations begin to dominate the distribution. [sent-338, score-0.167]

95 At low temperatures the modes around the low energy configurations become increasingly peaked and standard MCMC estimators become trapped in modes from which they are unable to escape [8, 7]. [sent-339, score-0.395]

96 Figures 3 and 4 show the RMSE curves of Gibbs sampling and GIS reg, side by side, as temperature is decreased in different models. [sent-341, score-0.321]

97 By contrast to MCMC procedures, the GIS procedures exhibit almost no accuracy loss as the temperature is lowered, and in fact sometimes improve their performance. [sent-342, score-0.155]

98 It is important to note that greedy importance sampling is not equivalent to adaptive importance sampling. [sent-346, score-0.813]

99 Sample blocks are completely independent in GIS, but sample points are not independent in AIS. [sent-347, score-0.191]

100 Exact sampling with coupled Markov chains and applications to statistical mechanics. [sent-398, score-0.204]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gis', 0.621), ('unbiasedness', 0.249), ('importance', 0.24), ('gisreg', 0.213), ('sampling', 0.204), ('hrp', 0.16), ('unbiased', 0.137), ('greedy', 0.129), ('rmse', 0.124), ('gibbs', 0.122), ('block', 0.119), ('temperature', 0.117), ('estimators', 0.116), ('mcmc', 0.107), ('indirect', 0.094), ('search', 0.094), ('carlo', 0.086), ('blocks', 0.085), ('variance', 0.083), ('rp', 0.081), ('monte', 0.08), ('qp', 0.079), ('boltzmann', 0.073), ('proposal', 0.071), ('totals', 0.071), ('weight', 0.067), ('estimator', 0.067), ('branching', 0.066), ('scheme', 0.065), ('qy', 0.062), ('ising', 0.062), ('initiating', 0.062), ('predecessor', 0.062), ('inference', 0.061), ('predecessors', 0.056), ('modes', 0.056), ('ce', 0.056), ('draw', 0.055), ('qrp', 0.053), ('temperatures', 0.053), ('sample', 0.053), ('points', 0.053), ('equalize', 0.053), ('path', 0.052), ('energy', 0.05), ('searching', 0.049), ('owing', 0.046), ('xs', 0.045), ('graphical', 0.041), ('inward', 0.039), ('tree', 0.039), ('procedure', 0.039), ('procedures', 0.038), ('weighting', 0.038), ('root', 0.037), ('domain', 0.037), ('satisfy', 0.037), ('dagum', 0.036), ('gisnew', 0.036), ('gisold', 0.036), ('metro', 0.036), ('ppi', 0.036), ('samplers', 0.036), ('trapped', 0.036), ('ydr', 0.036), ('hq', 0.035), ('correction', 0.034), ('estimation', 0.033), ('poor', 0.033), ('constraint', 0.032), ('point', 0.032), ('paths', 0.031), ('wsu', 0.031), ('avg', 0.031), ('reg', 0.031), ('markov', 0.03), ('technique', 0.03), ('guration', 0.029), ('peaked', 0.028), ('fair', 0.028), ('lowered', 0.028), ('gurations', 0.028), ('exact', 0.028), ('reduces', 0.027), ('allocated', 0.026), ('bu', 0.026), ('metropolis', 0.026), ('ef', 0.026), ('target', 0.026), ('generalized', 0.026), ('deterministic', 0.026), ('hybrid', 0.025), ('arrived', 0.025), ('exibility', 0.024), ('explicitly', 0.024), ('implementing', 0.024), ('schuurmans', 0.024), ('imposes', 0.024), ('random', 0.023), ('estimating', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 174 nips-2002-Regularized Greedy Importance Sampling

Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi

Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.

2 0.1804474 41 nips-2002-Bayesian Monte Carlo

Author: Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

3 0.11414781 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

Author: Patrik O. Hoyer, Aapo Hyvärinen

Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying significantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Specifically, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.

4 0.11093513 181 nips-2002-Self Supervised Boosting

Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton

Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.

5 0.077767901 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters

Author: Rubén Morales-menéndez, Nando D. Freitas, David Poole

Abstract: This paper discusses the application of particle filtering algorithms to fault diagnosis in complex industrial processes. We consider two ubiquitous processes: an industrial dryer and a level tank. For these applications, we compared three particle filtering variants: standard particle filtering, Rao-Blackwellised particle filtering and a version of RaoBlackwellised particle filtering that does one-step look-ahead to select good sampling regions. We show that the overhead of the extra processing per particle of the more sophisticated methods is more than compensated by the decrease in error and variance.

6 0.071290374 124 nips-2002-Learning Graphical Models with Mercer Kernels

7 0.063315891 169 nips-2002-Real-Time Particle Filters

8 0.062114611 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

9 0.059318807 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

10 0.05613514 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

11 0.056048289 94 nips-2002-Fractional Belief Propagation

12 0.055650584 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

13 0.05296297 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

14 0.05203525 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

15 0.051147088 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

16 0.050514709 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

17 0.050476689 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

18 0.048199899 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

19 0.045979153 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

20 0.045405082 32 nips-2002-Approximate Inference and Protein-Folding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.16), (1, -0.018), (2, -0.053), (3, 0.036), (4, -0.021), (5, 0.049), (6, -0.119), (7, 0.087), (8, -0.04), (9, 0.006), (10, 0.027), (11, -0.035), (12, 0.114), (13, 0.043), (14, -0.006), (15, -0.003), (16, -0.075), (17, -0.04), (18, -0.026), (19, -0.104), (20, 0.015), (21, 0.09), (22, -0.058), (23, 0.107), (24, -0.137), (25, -0.198), (26, 0.064), (27, 0.07), (28, -0.052), (29, 0.038), (30, 0.152), (31, 0.04), (32, -0.062), (33, -0.047), (34, 0.123), (35, 0.006), (36, -0.037), (37, -0.13), (38, 0.106), (39, 0.115), (40, -0.206), (41, -0.05), (42, 0.027), (43, -0.022), (44, 0.108), (45, -0.038), (46, -0.112), (47, -0.123), (48, -0.029), (49, -0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96247053 174 nips-2002-Regularized Greedy Importance Sampling

Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi

Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.

2 0.81375718 41 nips-2002-Bayesian Monte Carlo

Author: Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

3 0.60749924 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters

Author: Rubén Morales-menéndez, Nando D. Freitas, David Poole

Abstract: This paper discusses the application of particle filtering algorithms to fault diagnosis in complex industrial processes. We consider two ubiquitous processes: an industrial dryer and a level tank. For these applications, we compared three particle filtering variants: standard particle filtering, Rao-Blackwellised particle filtering and a version of RaoBlackwellised particle filtering that does one-step look-ahead to select good sampling regions. We show that the overhead of the extra processing per particle of the more sophisticated methods is more than compensated by the decrease in error and variance.

4 0.51035029 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

Author: Patrik O. Hoyer, Aapo Hyvärinen

Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying significantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Specifically, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.

5 0.43934351 107 nips-2002-Identity Uncertainty and Citation Matching

Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.

6 0.40644151 169 nips-2002-Real-Time Particle Filters

7 0.33701503 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting

8 0.33364379 181 nips-2002-Self Supervised Boosting

9 0.33002803 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

10 0.32968348 163 nips-2002-Prediction and Semantic Association

11 0.32011801 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

12 0.31355345 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

13 0.29890403 42 nips-2002-Bias-Optimal Incremental Problem Solving

14 0.2920523 124 nips-2002-Learning Graphical Models with Mercer Kernels

15 0.29099074 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

16 0.27699357 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

17 0.27648649 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

18 0.27268082 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

19 0.26991603 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

20 0.26829892 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.393), (23, 0.016), (42, 0.075), (54, 0.135), (55, 0.027), (57, 0.016), (67, 0.016), (68, 0.027), (74, 0.069), (92, 0.029), (98, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95403671 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun

Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.

same-paper 2 0.87182099 174 nips-2002-Regularized Greedy Importance Sampling

Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi

Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.

3 0.81900084 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Author: Elzbieta Pekalska, David Tax, Robert Duin

Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identification or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classifier for dissimilarity representations. By the use of linear programming an efficient one-class description can be found, based on a small number of prototype objects. This classifier can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classifier by Campbell and Bennett is given.

4 0.80239671 125 nips-2002-Learning Semantic Similarity

Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor

Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1

5 0.65302372 147 nips-2002-Monaural Speech Separation

Author: Guoning Hu, Deliang Wang

Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ   f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows:     R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t )     SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.

6 0.64719838 41 nips-2002-Bayesian Monte Carlo

7 0.62519121 163 nips-2002-Prediction and Semantic Association

8 0.57886988 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.56886774 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

10 0.5576694 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

11 0.55419385 169 nips-2002-Real-Time Particle Filters

12 0.54548931 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

13 0.54370558 167 nips-2002-Rational Kernels

14 0.54125559 85 nips-2002-Fast Kernels for String and Tree Matching

15 0.5411678 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters

16 0.54097772 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

17 0.53579938 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

18 0.53262162 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

19 0.52679193 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

20 0.52395409 45 nips-2002-Boosted Dyadic Kernel Discriminants