nips nips2003 nips2003-117 knowledge-graph by maker-knowledge-mining

117 nips-2003-Linear Response for Approximate Inference

Source: pdf

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efﬁcient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulﬁll. The ﬁrst algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable ﬁxed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Belief propagation on cyclic graphs is an efﬁcient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. [sent-5, score-0.431]

2 In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulﬁll. [sent-6, score-0.187]

3 The ﬁrst algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable ﬁxed point. [sent-7, score-0.321]

4 1 Introduction Belief propagation (BP) has become an important tool for approximate inference on graphs with cycles. [sent-10, score-0.215]

5 In particular, recent developments show that the stable ﬁxed points are local minima of the Bethe free energy [10, 1], which paved the way for more accurate “generalized belief propagation” algorithms and convergent alternatives to BP [11, 6]. [sent-13, score-0.339]

6 Despite its success, BP does not provide a prescription to compute joint probabilities over pairs of non-neighboring nodes in the graph. [sent-14, score-0.159]

7 However, when cycles exist, it is not clear what approximate procedure is appropriate. [sent-16, score-0.064]

8 We show that the required estimates can be obtained by computing the “sensitivity” of the node marginals to small changes in the node potentials. [sent-18, score-0.262]

9 Based on this idea, we present two algorithms to estimate the joint probabilities of arbitrary pairs of nodes. [sent-19, score-0.066]

10 These results are interesting in the inference domain but may also have future applications to learning graphical models from data. [sent-20, score-0.06]

11 2 Belief Propagation on Factor Graphs Let V index a collection of random variables {Xi }i∈V and let xi denote values of Xi . [sent-22, score-0.417]

12 For a subset of nodes α ⊂ V let Xα = {Xi }i∈α be the variable associated with that subset, and xα be values of Xα . [sent-23, score-0.093]

13 over X = XV is assumed to have the following form, PX (X = x) = 1 Z ψα (xα ) α∈A ψi (xi ) (1) i∈V where Z is the normalization constant (the partition function) and ψα , ψi are positive potential functions deﬁned on subsets and single nodes respectively. [sent-26, score-0.093]

14 The decomposition of (1) is consistent with a factor graph with function nodes over Xα and variables nodes Xi . [sent-29, score-0.211]

15 For each i ∈ V denote its neighbors by Ni = {α ∈ A : α i} and for each subset α its neighbors are simply Nα = {i ∈ α}. [sent-30, score-0.068]

16 Factor graphs are a convenient representation for structured probabilistic models and subsume undirected graphical models and acyclic directed graphical models [3]. [sent-31, score-0.122]

17 Marginal distributions over factor nodes and variable nodes are expressed in terms of these messages as follows, bα (xα ) = 1 ψα (xα ) γα niα (xi ) bi (xi ) = i∈Nα 1 ψi (xi ) γi mαi (xi ) (3) α∈Ni where γi , γα are normalization constants. [sent-33, score-0.514]

18 α i 3 Linear Response In the following we will be interested in computing estimates of joint probability distributions for arbitrary pairs of nodes. [sent-35, score-0.094]

19 We propose a method based on the linear response theorem. [sent-36, score-0.095]

20 The idea is to study changes in the system when we perturb single node potentials, 0 log ψi (xi ) = log ψi (xi ) + θi (xi ) (6) The superscript 0 indicates unperturbed quantities in (6) and the following. [sent-37, score-0.136]

21 Expressions for higher order cumulants can be derived by taking further derivatives of −F (θ). [sent-39, score-0.086]

22 Notice from (9) that the covariance estimates are obtained by studying the perturbations in pj (xj ) as we vary θi (xi ). [sent-40, score-0.234]

23 This is not practical in general since calculating pj (xj ) itself is intractable. [sent-41, score-0.056]

24 Instead, we consider perturbations of approximate marginal distributions {bj }. [sent-42, score-0.188]

25 In the following we will assume that bj (xj ; θ) (with the dependence on θ made explicit) are the beliefs at a local minimum of the BP-Gibbs free energy (subject to constraints). [sent-43, score-0.364]

26 ∂b (x ;θ) j j In analogy to (9), let Cij (xi , xj ) = ∂θi (xi ) θ=0 be the linear response estimated covariance, and deﬁne the linear response estimated joint pairwise marginal as bLR (xi , xj ) = b0 (xi )b0 (xj ) + Cij (xi , xj ) ij i j (10) . [sent-44, score-1.348]

27 We will show that bij and Cij satisfy a number of important properties which make them suitable as approximations of joint marginals and covariances. [sent-46, score-0.241]

28 b0 (xi ) i First we show that Cij (xi , xj ) can be interpreted as the Hessian of a well-behaved convex function. [sent-47, score-0.36]

29 Let C be the set of beliefs that satisfy the constraints (5). [sent-48, score-0.129]

30 The approximate marginals {b0 } along with the joint marginals {b0 } form a local minimum of the Betheα i . [sent-49, score-0.349]

31 Gibbs free energy (subject to b0 = {b0 , b0 } ∈ C). [sent-50, score-0.096]

32 Assume that b0 is a strict local minimum i α of GBP (the strict local minimality is in fact attained if we use loopy belief propagation [1]). [sent-51, score-0.289]

33 Now we can deﬁne G∗ (θ) = inf GBP (b) − b∈D∩C i,xi bi (xi )θi (xi ) (11) G∗ (θ) is a concave function since it is the inﬁmum of a set of linear functions in θ. [sent-53, score-0.264]

34 Since b0 is a strict local minimum when θ = 0, small perturbations in θ will result in small perturbations in b0 , so that G∗ is well-behaved on an open neighborhood ∂G∗ (θ) around θ = 0. [sent-55, score-0.287]

35 Differentiating G∗ (θ), we get ∂θj (xj ) = −bj (xj ; θ) so we now have Cij (xi , xj ) = ∂bj (xj ;θ) ∂θi (xi ) θ=0 =− ∂ 2 G∗ (θ) ∂θi (xi )∂θj (xj ) (12) θ=0 In essence, we can interpret G∗ (θ) as a local convex dual of GBP (b) (by restricting attention to D). [sent-56, score-0.411]

36 Since GBP is an approximation to the exact Gibbs free energy [8], which is in turn dual to F (θ) [4], G∗ (θ) can be seen as an approximation to F (θ) for small values of θ. [sent-57, score-0.116]

37 For that reason we can take its second derivatives Cij (xi , xj ) as approximations to the exact covariances (which are second derivatives of −F (θ)). [sent-58, score-0.588]

38 Theorem 1 The approximate covariance satisﬁes the following symmetry: Cij (xi , xj ) = Cji (xj , xi ) (13) Proof: The covariances are second derivatives of −G∗ (θ) at θ = 0 so we can interchange the order of the derivatives since G∗ (θ) is well-behaved on a neighborhood around θ = 0. [sent-59, score-1.084]

39 ij Since −F (θ) is convex, its Hessian matrix with entries given in (9) is positive semi-deﬁnite. [sent-61, score-0.08]

40 Similarly, since the approximate covariances Cij (xi , xj ) are second derivatives of a convex function −G∗ (θ), we have: Theorem 3 The matrix formed from the approximate covariances Cij (xi , xj ) by varying i and xi over the rows and varying j, xj over the columns is positive semi-deﬁnite. [sent-62, score-1.907]

41 4 Propagating Perturbations for Linear Response Recall from (10) that we need the ﬁrst derivative of bi (xi ; θ) with respect to θj (xj ) at θ = 0. [sent-64, score-0.243]

42 This does not automatically imply that we need an analytic expression for bi (xi ; θ) in terms of θ. [sent-65, score-0.243]

43 In this section we show how we may compute these ﬁrst derivatives by expanding all quantities and equations up to ﬁrst order in θ and keeping track of ﬁrst order dependencies. [sent-66, score-0.145]

44 First we assume that belief propagation has converged to a stable ﬁxed point. [sent-67, score-0.253]

45 ij The unconventional form of this expansion will make subsequent derivations more transparent. [sent-69, score-0.05]

46 The “response matrices” Rij , Niα,j , Mαi,j measure the sensitivities of the corresponding logarithms of beliefs and messages to changes in the log potentials log ψj (yj ) at node j. [sent-70, score-0.282]

47 Just as for belief propagation, where messages are normalized to avoid numerical over or under ﬂow, after each update the super-messages are “normalized” as follows, Mαi,k (xi , yk ) ← Mαi,k (xi , yk ) − Mαi,k (xi , yk ) (22) xi and similarly for Niα,k . [sent-72, score-1.089]

48 Moreover, there exists a scheduling of the super-messages such that the algorithm converges after just one iteration (i. [sent-74, score-0.036]

49 Sketch of Proof: Both results follow from the fact that belief propagation on tree structured factor graphs computes the exact single node marginals for arbitrary θ. [sent-77, score-0.416]

50 Since the supermessages are the ﬁrst order terms of the BP updates with arbitrary θ, we can invoke the exact linear response theorem given by (8) and (9) to claim that the algorithm converges to the exact joint pairwise marginal distributions. [sent-78, score-0.269]

51 For graphs with cycles, BP is not guaranteed to converge. [sent-79, score-0.045]

52 Theorem 5 If the messages {m0 (xi ), n0 (xi )} have converged to a stable ﬁxed point, αi iα then the update equations for the super-messages (20,21,22) will also converge to a unique stable ﬁxed point, using any scheduling of the super-messages. [sent-81, score-0.28]

53 Sketch of Proof3 : We ﬁrst note that the updates (20,21,22) form a linear system of equations which can only have one stable ﬁxed point. [sent-82, score-0.123]

54 The existence and stability of this ﬁxed 3 For a more detailed proof of the above two theorems we refer to [9]. [sent-83, score-0.022]

55 point is proven by observing that the ﬁrst order term is identical to the one obtained from a linear expansion of the BP equations (2) around its stable ﬁxed point. [sent-84, score-0.123]

56 Finally, the SteinRosenberg theorem guarantees that any scheduling will converge to the same ﬁxed point. [sent-85, score-0.073]

57 5 Inverting Matrices for Linear Response In this section we describe an alternative method to compute ∂θi (xi ) ∂bk (xk ) ∂bi (xi ) ∂θk (xk ) by ﬁrst computing and then inverting the matrix formed by ﬂattened {i, xi } into a row index and {k, xk } into a column index. [sent-86, score-0.689]

58 The intuition is that while perturbations in a single θi (xi ) affect the whole system, perturbations in a single bi (xi ) (while keeping the others ﬁxed) affect each subsystem α ∈ A independently (see ∂θi (xi ∂bi [8]). [sent-88, score-0.536]

59 ) (xk First we propose minimal representations for bi , θi and the messages. [sent-90, score-0.243]

60 We assume that for each node i there is a distinguished value xi = 0. [sent-91, score-0.476]

61 Set θi (0) = 0 while functionally deﬁne ∂θi (xi bi (0) = 1 − xi =0 bi (xi ). [sent-92, score-0.903]

62 Now the matrix formed by ∂bk (xk) for each i, k and xi , xk = 0 ) is invertible and its inverse gives us the desired covariances for xi , xk = 0. [sent-93, score-1.452]

63 Values for xi = 0 or xk = 0 can then be computed using (14). [sent-94, score-0.627]

64 This can be achieved by deﬁning new quantities λiα (xi ) = log niα (xi ) niα (0) for all i and xi = 0. [sent-96, score-0.467]

65 The λiα ’s can be interpreted as Lagrange multipliers to enforce the consistency constraints (5) [10]. [sent-97, score-0.056]

66 We will use these multipliers instead of the messages in this section. [sent-98, score-0.083]

67 Since perturbations in bk (xk ) (while keeping other bj ’s ﬁxed) do not affect nodes not directly connected to k, we have ∂λiα (xi)) = 0 for ∂bk (xk k ∈ α. [sent-101, score-0.604]

68 When k ∈ α, these can in turn be obtained by solving, for each α, a matrix inverse. [sent-102, score-0.03]

69 Differentiating (27) by bk (xk ), we obtain α Cij (xi , xj ) δik δxi xk = ∂λjα (xj ) ∂bk (xk ) (29) j∈α xj =0 α Cij (xi , xj ) = bα (xi , xj ) − bi (xi )bj (xj ) if i = j bi (xi )δxi xj − bi (xi )bj (xj ) if i = j (30) for each i, k ∈ Nα and xi , xk = 0. [sent-103, score-3.434]

70 5 2 (c) Figure 1: L1 -error in covariances for MF+LR, BP, BP+LR and “conditioning”. [sent-110, score-0.144]

71 The results are separately plotted for neighboring nodes (a), next-to-nearest neighboring nodes (b) and the remaining nodes (c). [sent-112, score-0.345]

72 The ﬁrst is a covariance matrix Cα where the ij th block is Cij (xi , xj ); ∂λ (x ) while the second matrix consists of all the desired derivatives ∂bjα kj) . [sent-114, score-0.536]

73 Hence the derivak (x −1 tives are given as elements of the inverse covariance matrix Cα . [sent-115, score-0.09]

74 Finally, plugging the ∂λ (x ) ∂θi (xi values of ∂bjα kj) into (28) now gives ∂bk (xk) and inverting that matrix will now give ) k (x us the desired approximate covariances over the whole graph. [sent-116, score-0.249]

75 Interestingly, the method only requires access to the beliefs at the local minimum, not to the potentials or Lagrange multipliers. [sent-117, score-0.14]

76 6 Experiment The accuracy of the estimated covariances Cij (xi , xj ) in the LR approximation was studied on a 6 × 6 square grid with only nearest neighbors connected and 3 states per node. [sent-118, score-0.514]

77 C was computed as Cij = bij − bi bj , with {bi , bj } the marginals of bij , and symmetrizing the result. [sent-120, score-0.805]

78 The error was computed as the absolute difference between the estimated and the true values, averaged over pairs of nodes and their possible states, and averaged over 25 random draws of the network. [sent-121, score-0.116]

79 An instantiation of a network was generated by randomly drawing the logarithm of the edge potentials from a zero mean Gaussian with a standard deviation ranging between [0, 2]. [sent-122, score-0.058]

80 The latter does however satisfy some desirable properties which are violated by conditioning (see section 7 for further discussion). [sent-125, score-0.127]

81 The computational complexity of the iterative linear response algorithm scales as O(N · E · D3 ) per iteration (N = #nodes, E = #edges, D = #states per node). [sent-127, score-0.095]

82 The noniterative algorithm scales slightly worse, O(N 3 · D3 ), but is based on a matrix inverse for which very efﬁcient implementations exist. [sent-128, score-0.054]

83 A question that remains open is whether we can improve the efﬁciency of the iterative algorithm when we are only interested in the joint distributions of neighboring nodes. [sent-129, score-0.076]

84 Thirdly, when applying the same techniques to Gaussian random ﬁelds, a propagation algorithm results that computes the inverse of the weight matrix exactly [9]. [sent-133, score-0.147]

85 In the case of more general continuous random ﬁeld models we are investigating whether linear response algorithms can be applied to the ﬁxed points of expectation propagation. [sent-134, score-0.095]

86 The most important distinguishing feature between the proposed LR algorithm and the conditioning procedure described in section 6 is the fact that the covariance estimate is automatically positive semi-deﬁnite. [sent-135, score-0.139]

87 Indeed the idea to include global constraints such as positive semi-deﬁniteness in approximate inference algorithms was proposed in [7]. [sent-136, score-0.11]

88 Other differences include automatic consistency between joint pairwise marginals from LR and node marginals from BP (not true for conditioning) and a convergence proof for the LR algorithm (absent for conditioning, but not observed to be a problem experimentally). [sent-137, score-0.382]

89 Finally, the non-iterative algorithm is applicable to all local minima in the Bethe-Gibbs free energy, even those that correspond to unstable ﬁxed points of BP. [sent-138, score-0.114]

90 Stable ﬁxed points of loopy belief propagation are minima of the bethe free energy. [sent-144, score-0.298]

91 Efﬁcient learning in Boltzmann machines using linear response theory. [sent-151, score-0.095]

92 Probabilistic inference by means of cluster variation method and linear response theory. [sent-169, score-0.129]

93 Semideﬁnite relaxations for approximate inference on graphs with cycles. [sent-182, score-0.122]

94 Linear response algorithms for approximate inference in graphical models. [sent-197, score-0.177]

95 CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. [sent-209, score-0.148]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xi', 0.417), ('xj', 0.336), ('lr', 0.261), ('cij', 0.254), ('bi', 0.243), ('xk', 0.21), ('bbp', 0.201), ('bp', 0.197), ('ni', 0.192), ('bk', 0.188), ('yk', 0.178), ('bj', 0.165), ('covariances', 0.144), ('rij', 0.121), ('marginals', 0.116), ('perturbations', 0.114), ('blr', 0.11), ('gbp', 0.108), ('conditioning', 0.103), ('nodes', 0.093), ('propagation', 0.093), ('belief', 0.078), ('response', 0.074), ('mf', 0.072), ('beliefs', 0.072), ('messages', 0.06), ('node', 0.059), ('bij', 0.058), ('stable', 0.057), ('pj', 0.056), ('yj', 0.055), ('derivatives', 0.054), ('differentiating', 0.051), ('ij', 0.05), ('free', 0.049), ('marginalization', 0.048), ('energy', 0.047), ('graphs', 0.045), ('equations', 0.045), ('bethe', 0.044), ('welling', 0.044), ('joint', 0.043), ('approximate', 0.043), ('message', 0.04), ('potentials', 0.037), ('theorem', 0.037), ('supermessages', 0.037), ('scheduling', 0.036), ('covariance', 0.036), ('inference', 0.034), ('neighbors', 0.034), ('minima', 0.034), ('constraints', 0.033), ('neighboring', 0.033), ('ik', 0.032), ('inverting', 0.032), ('xed', 0.032), ('cumulants', 0.032), ('ci', 0.031), ('marginal', 0.031), ('local', 0.031), ('matrix', 0.03), ('gibbs', 0.029), ('estimates', 0.028), ('strict', 0.028), ('pi', 0.027), ('log', 0.027), ('cumulant', 0.027), ('pairwise', 0.026), ('graphical', 0.026), ('acyclic', 0.025), ('px', 0.025), ('factor', 0.025), ('converged', 0.025), ('berkeley', 0.024), ('expansions', 0.024), ('kikuchi', 0.024), ('boltzmann', 0.024), ('kj', 0.024), ('convex', 0.024), ('satisfy', 0.024), ('inverse', 0.024), ('pairs', 0.023), ('keeping', 0.023), ('multipliers', 0.023), ('quantities', 0.023), ('convergent', 0.022), ('semide', 0.022), ('hessian', 0.022), ('proof', 0.022), ('linear', 0.021), ('inserting', 0.021), ('cycles', 0.021), ('teh', 0.021), ('pij', 0.021), ('edge', 0.021), ('alternatives', 0.021), ('wainwright', 0.021), ('affect', 0.021), ('dual', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

2 0.24693219 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1

3 0.20360959 152 nips-2003-Pairwise Clustering and Graphical Models

Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss

Abstract: Signiﬁcant progress in clustering has been achieved by algorithms that are based on pairwise afﬁnities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efﬁcient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difﬁcult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for afﬁnity matrices based on labeled data. 1

4 0.14610089 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classiﬁcation. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classiﬁers under known transformations based on semideﬁnite programming. We present a new learning algorithm— the Semideﬁnite Programming Machine (SDPM)—which is able to ﬁnd a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and ﬁnd improvements over known methods. 1

5 0.14231992 32 nips-2003-Approximate Expectation Maximization

Author: Tom Heskes, Onno Zoeter, Wim Wiegerinck

Abstract: We discuss the integration of the expectation-maximization (EM) algorithm for maximum likelihood learning of Bayesian networks with belief propagation algorithms for approximate inference. Specifically we propose to combine the outer-loop step of convergent belief propagation algorithms with the M-step of the EM algorithm. This then yields an approximate EM algorithm that is essentially still double loop, with the important advantage of an inner loop that is guaranteed to converge. Simulations illustrate the merits of such an approach. 1

6 0.1390377 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

7 0.1382353 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

8 0.13154715 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

9 0.12746736 193 nips-2003-Variational Linear Response

10 0.12159494 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

11 0.1214907 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

12 0.12059696 108 nips-2003-Learning a Distance Metric from Relative Comparisons

13 0.11630875 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

14 0.11507702 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

15 0.11163627 100 nips-2003-Laplace Propagation

16 0.10913177 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

17 0.10142174 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

18 0.09604951 169 nips-2003-Sample Propagation

19 0.094576478 126 nips-2003-Measure Based Regularization

20 0.089405537 115 nips-2003-Linear Dependent Dimensionality Reduction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.255), (1, -0.14), (2, -0.142), (3, 0.237), (4, 0.161), (5, -0.171), (6, 0.023), (7, -0.027), (8, 0.023), (9, -0.059), (10, 0.058), (11, 0.116), (12, 0.0), (13, -0.109), (14, -0.057), (15, -0.003), (16, 0.093), (17, 0.069), (18, -0.241), (19, 0.023), (20, 0.146), (21, -0.022), (22, -0.017), (23, 0.022), (24, -0.002), (25, 0.08), (26, -0.088), (27, -0.195), (28, 0.028), (29, 0.041), (30, -0.015), (31, -0.062), (32, 0.005), (33, -0.014), (34, -0.022), (35, 0.085), (36, -0.093), (37, 0.031), (38, 0.094), (39, 0.017), (40, 0.012), (41, -0.073), (42, -0.063), (43, 0.037), (44, 0.089), (45, 0.025), (46, -0.128), (47, -0.046), (48, -0.174), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97164357 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

2 0.65856713 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

3 0.63760555 152 nips-2003-Pairwise Clustering and Graphical Models

Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss

4 0.57454175 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

5 0.53013116 108 nips-2003-Learning a Distance Metric from Relative Comparisons

Author: Matthew Schultz, Thorsten Joachims

Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a ﬂexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling ﬂexibility of the algorithm on a collection of text documents. 1

6 0.52790213 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

7 0.5241636 32 nips-2003-Approximate Expectation Maximization

8 0.51691169 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

9 0.51062638 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

10 0.49878588 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

11 0.49007556 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

12 0.4749994 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

13 0.46868533 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

14 0.39664191 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

15 0.38390744 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

16 0.37112498 126 nips-2003-Measure Based Regularization

17 0.35882202 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

18 0.35813564 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

19 0.34659699 120 nips-2003-Locality Preserving Projections

20 0.34506369 169 nips-2003-Sample Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.039), (5, 0.012), (11, 0.015), (29, 0.02), (30, 0.012), (35, 0.124), (47, 0.174), (53, 0.08), (59, 0.011), (69, 0.017), (71, 0.155), (76, 0.049), (85, 0.117), (91, 0.07), (99, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89829713 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

2 0.83435404 5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

3 0.77996927 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

4 0.77405816 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

Author: Konrad P. Körding, Daniel M. Wolpert

Abstract: When we learn a new motor skill, we have to contend with both the variability inherent in our sensors and the task. The sensory uncertainty can be reduced by using information about the distribution of previously experienced tasks. Here we impose a distribution on a novel sensorimotor task and manipulate the variability of the sensory feedback. We show that subjects internally represent both the distribution of the task as well as their sensory uncertainty. Moreover, they combine these two sources of information in a way that is qualitatively predicted by optimal Bayesian processing. We further analyze if the subjects can represent multimodal distributions such as mixtures of Gaussians. The results show that the CNS employs probabilistic models during sensorimotor learning even when the priors are multimodal.

5 0.76696151 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

6 0.76217949 145 nips-2003-Online Classification on a Budget

7 0.75715029 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

8 0.7568984 189 nips-2003-Tree-structured Approximations by Expectation Propagation

9 0.75185317 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

10 0.74920112 41 nips-2003-Boosting versus Covering

11 0.74762291 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

12 0.74556965 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.74481267 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

14 0.74386901 124 nips-2003-Max-Margin Markov Networks

15 0.74371564 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

16 0.74309093 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

17 0.74224824 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

18 0.73887587 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

19 0.73504186 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

20 0.73355836 3 nips-2003-AUC Optimization vs. Error Rate Minimization