nips nips2007 nips2007-190 knowledge-graph by maker-knowledge-mining

190 nips-2007-Support Vector Machine Classification with Indefinite Kernels


Source: pdf

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. [sent-4, score-0.49]

2 This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. [sent-5, score-0.407]

3 Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. [sent-6, score-0.57]

4 Our interest in indefinite kernels is motivated by several observations. [sent-9, score-0.163]

5 First, certain similarity measures take advantage of application-specific structure in the data and often display excellent empirical classification performance. [sent-10, score-0.083]

6 Unlike popular kernels used in support vector machine classification, these similarity matrices are often indefinite and so do not necessarily correspond to a reproducing kernel Hilbert space (see [1] for a discussion). [sent-11, score-0.529]

7 An application of classification with indefinite kernels to image classification using Earth Mover’s Distance was discussed in [2]. [sent-12, score-0.163]

8 Similarity measures for protein sequences such as the SmithWaterman and BLAST scores are indefinite yet have provided hints for constructing useful positive semidefinite kernels such as those decribed in [3] or have been transformed into positive semidefinite kernels (see [4] for example). [sent-13, score-0.5]

9 Here instead, our objective is to directly use these indefinite similarity measures for classification. [sent-14, score-0.116]

10 Finally, it is sometimes impossible to prove that some kernels satisfy Mercer’s condition or the numerical complexity of evaluating the exact positive semidefinite kernel is too high and a proxy (and not necessarily positive semidefinite) kernel has to be used instead (see [9] for example). [sent-17, score-0.961]

11 1 Current results Several methods have been proposed for dealing with indefinite kernels in SVMs. [sent-20, score-0.163]

12 A first direction embeds data in a pseudo-Euclidean (pE) space: [10] for example, formulates the classification problem with an indefinite kernel as that of minimizing the distance between convex hulls formed from the two categories of data embedded in the pE space. [sent-21, score-0.402]

13 The nonseparable case is handled in the same manner using reduced convex hulls (see [11] for a discussion of SVM geometric interpretations). [sent-22, score-0.106]

14 Another direction applies direct spectral transformations to indefinite kernels: flipping the negative eigenvalues or shifting the kernel’s eigenvalues and reconstructing the kernel with the original eigenvectors in order to produce a positive semidefinite kernel (see [12] and [2]). [sent-23, score-0.78]

15 Yet another option is to reformulate either the maximum margin problem or its dual in order to use the indefinite kernel in a convex optimization problem (see [13]). [sent-24, score-0.46]

16 An equivalent formulation of SVM with the same objective but where the kernel appears in the constraints can be modified to a convex problem by eliminating the kernel from the objective. [sent-25, score-0.65]

17 2 Contribution Here, instead of directly transforming the indefinite kernel, we simultaneously learn the support vector weights and a proxy positive semidefinite kernel matrix, while penalizing the distance between this proxy kernel and the original, indefinite one. [sent-28, score-0.819]

18 Our main result is that the kernel learning part of that problem can be solved explicitly, meaning that the classification problem with indefinite kernels can simply be formulated as a perturbation of the positive semidefinite case. [sent-29, score-0.532]

19 Our formulation can also be interpreted as a worst-case robust classification problem with uncertainty on the kernel matrix. [sent-30, score-0.309]

20 In that sense, indefinite similarity matrices are seen as noisy observations of an unknown positive semidefinite kernel. [sent-31, score-0.129]

21 From a complexity standpoint, while the original SVM classification problem with indefinite kernel is nonconvex, the robustification we detail here is a convex problem, and hence can be solved efficiently with guaranteed complexity bounds. [sent-32, score-0.46]

22 In Section 2 we formulate our main classification problem and detail its interpretation as a robust SVM. [sent-34, score-0.125]

23 2 SVM with indefinite kernels Here, we introduce our robustification of the SVM classification problem with indefinite kernels. [sent-37, score-0.163]

24 1 Robust classification Let K ∈ Sn be a given kernel matrix and y ∈ Rn be the vector of labels, with Y = diag(y) the matrix with diagonal y, where Sn is the set of symmetric matrices of size n and Rn is the set of n-vectors of real numbers. [sent-39, score-0.376]

25 We can write the dual of the SVM classification problem with hinge loss and quadratic penalty as: maximize αT e − Tr(K(Y α)(Y α)T )/2 subject to αT y = 0 0≤α≤C (1) in the variable α ∈ Rn and where e is an n-vector of ones. [sent-40, score-0.19]

26 When K is positive semidefinite, this problem is a convex quadratic program. [sent-41, score-0.178]

27 Suppose now that we are given an indefinite kernel matrix K0 ∈ Sn . [sent-42, score-0.312]

28 This can be interpreted as a worst-case robust 2 classification problem with bounded uncertainty on the kernel matrix K. [sent-44, score-0.349]

29 The above problem is infeasible for some values of β so we replace here the hard constraint on K by a penalty on the distance between the proxy positive semidefinite kernel and the given indefinite matrix. [sent-45, score-0.464]

30 The problem we solve is now: 1 max min αT e − Tr(K(Y α)(Y α)T ) + ρ K − K0 2 (3) F 2 {αT y=0,0≤α≤C} {K 0} in the variables K ∈ Sn and α ∈ Rn , where the parameter ρ > 0 controls the magnitude of the penalty on the distance between K and K0 . [sent-46, score-0.085]

31 The inner minimization problem is a convex conic program on K. [sent-47, score-0.196]

32 Also, as the pointwise minimum of a family of concave quadratic functions of α, the solution to the inner problem is a concave function of α, and hence the outer optimization problem is also convex (see [15] for further details). [sent-48, score-0.266]

33 Thus, (3) is a concave maximization problem subject to linear constraints and is therefore a convex problem in α. [sent-49, score-0.166]

34 Our key result here is that the inner kernel learning optimization problem can be solved in closed form. [sent-50, score-0.381]

35 This is the projection of the K0 + (1/4ρ)(Y α)(Y α)T on the cone of positive semidefinite matrices. [sent-52, score-0.117]

36 The optimal solution to this problem is then given by: K ∗ = (K0 + (1/4ρ)(Y α)(Y α)T )+ (4) T i max(0, λi )xi xi where X+ is the positive part of the matrix X, i. [sent-53, score-0.098]

37 X+ = where λi and xi are the ith eigenvalue and eigenvector of the matrix X. [sent-55, score-0.215]

38 Plugging this solution into (3), we get: 1 max αT e − Tr(K ∗ (Y α)(Y α)T ) + ρ K ∗ − K0 2 F 2 {αT y=0, 0≤α≤C} in the variable α ∈ Rn , where (Y α)(Y α)T is the rank one matrix with coefficients yi αi αj yj , i, j = 1, . [sent-56, score-0.076]

39 We can rewrite this as an eigenvalue optimization problem by using the eigenvalue representation of X+ . [sent-60, score-0.233]

40 Using the same technique, we can also rewrite the term K ∗ − K0 |2 using this eigenvalue decomposition. [sent-62, score-0.102]

41 2 Dual problem Because problem (3) is convex with at least one compact feasible set, we can formulate the dual problem to (5) by simply switching the max and the min. [sent-65, score-0.273]

42 The inner maximization is a quadratic program in α, and hence has a quadratic program as its dual. [sent-66, score-0.26]

43 We then get the dual by plugging this inner dual quadratic program into the outer minimization, to get the following problem: minimize Tr(K −1 (Y (e − λ + µ + yν))(Y (e − λ + µ + yν))T )/2 + CµT e + ρ K − K0 subject to K 2 F 0, λ, µ ≥ 0 (6) 3 in the variables K ∈ Sn , λ, µ ∈ Rn and ν ∈ R. [sent-67, score-0.38]

44 This dual problem is a quadratic program in the variables λ and µ which correspond to the primal constraints 0 ≤ α ≤ C and ν which is the dual variable for the constraint αT y = 0. [sent-68, score-0.27]

45 As we have seen earlier, any feasible solution to the primal problem produces a corresponding kernel in (4), and plugging this kernel into the dual problem in (6) allows us to calculate a dual feasible point by solving a quadratic program which gives a dual objective value, i. [sent-69, score-1.094]

46 This bound can then be used to compute a duality gap and track convergence. [sent-72, score-0.161]

47 3 Interpretation We noted that our problem can be viewed as a worst-case robust classification problem with uncertainty on the kernel matrix. [sent-74, score-0.309]

48 Our explicit solution of the optimal worst-case kernel given in (4) is the projection of a penalized rank-one update to the indefinite kernel on the cone of positive semidefinite matrices. [sent-75, score-0.661]

49 As ρ tends to infinity, the rank-one update has less effect and in the limit, the optimal kernel is the kernel given by zeroing out the negative eigenvalues of the indefinite kernel. [sent-76, score-0.622]

50 This means that if the indefinite kernel contains a very small amount of noise, the best positive semidefinite kernel to use with SVM in our framework is the positive part of the indefinite kernel. [sent-77, score-0.66]

51 This limit as ρ tends to infinity also motivates a heuristic for the transformation of the kernel on the testing set. [sent-78, score-0.308]

52 Since the negative eigenvalues of the training kernel are thresholded to zero in the limit, the same transformation should occur for the test kernel. [sent-79, score-0.386]

53 Hence, we update the entries of the full kernel corresponding to training instances by the rank-one update resulting from the optimal solution to (3) and threshold the negative eigenvalues of the full kernel matrix to zero. [sent-80, score-0.662]

54 We then use the test kernel values from the resulting positive semidefinite matrix. [sent-81, score-0.33]

55 The optimization problem is the maximization of a nondifferentiable concave function subject to convex constraints. [sent-83, score-0.234]

56 For numerical stability, in both algorithms, we quadratically smooth our objective to calculate a gradient instead. [sent-85, score-0.111]

57 We first describe a simple projected gradient method which has numerically cheap iterations but has no convergence bound. [sent-86, score-0.161]

58 We then show how to apply the much more efficient analytic center cutting plane method whose iterations are slightly more complex but which converges linearly. [sent-87, score-0.485]

59 Gradient Calculating the gradient of our objective requires a full eigenvalue decomposition to compute the gradient of each eigenvalue. [sent-92, score-0.233]

60 Given a matrix X(α), the derivative of the ith eigenvalue with respect to α is given by: ∂λi (X(α)) T ∂X(α) = vi vi (7) ∂α ∂α where vi is the ith eigenvector of X(α). [sent-93, score-0.501]

61 We note that eigenvalues of symmetric matrices are not differentiable when some of them have multiplicities greater than one (see [17] for a discussion). [sent-95, score-0.141]

62 In practice however, most tested kernels were of full rank with distinct eigenvalues so we ignore this issue here. [sent-96, score-0.241]

63 One may also consider projected subgradient methods, which are much slower, or use subgradients for analytic center cutting plane methods (which does not affect complexity). [sent-97, score-0.516]

64 1 Projected gradient method The projected gradient method takes a steepest descent, then projects the new point back onto the feasible region (see [18] for example). [sent-99, score-0.286]

65 In order to use these methods the objective function must be differentiable and the method is only efficient if the projection step is numerically cheap. [sent-100, score-0.157]

66 We choose an initial point α0 ∈ Rn and the algorithm proceeds as follows: Projected gradient method 1. [sent-101, score-0.076]

67 The complexity of each iteration breaks down as follows. [sent-107, score-0.095]

68 This requires an eigenvalue decomposition and costs O(n3 ). [sent-109, score-0.102]

69 We note that a line search would be costly because it would require multiple eigenvalue decompositions to recalculate the objective multiple times. [sent-110, score-0.135]

70 This is a projection onto the region A = {αT y = 0, 0 ≤ α ≤ C} and can be solved explicitly by sorting the vector of entries, with cost O(n log n). [sent-112, score-0.152]

71 We can compute a duality gap using the results of §2. [sent-114, score-0.161]

72 2: let Ki = (K0 + (Y αi )(Y αi )T /4ρ)+ (the kernel at iteration i), then solving problem (1) which is just an SVM with a convex kernel Ki produces an upper bound on (5), and hence a bound on the suboptimality of the current solution. [sent-115, score-0.688]

73 The method then works as follows (see [18] for a more complete reference on cutting plane methods): Analytic center cutting plane method 1. [sent-122, score-0.602]

74 Compute αi as the analytic center of Ai by solving: m αi+1 = argmin − log(bi − aT y) i n y∈R i=1 where aT represents the ith row of coefficients from the left-hand side of {A0 ≤ b0 }. [sent-123, score-0.241]

75 Compute ∇f (x) at the center αi+1 and update the (polyhedral) localization set: Ai+1 = Ai ∪ {∇f (αi+1 )(α − αi+1 ) ≥ 0} 3. [sent-125, score-0.119]

76 The complexity of each iteration breaks down as follows. [sent-127, score-0.095]

77 This step computes the analytic center of a polyhedron and can be solved in O(n3 ) operations using interior point methods for example. [sent-129, score-0.231]

78 ACCPM is provably convergent in O(n log(1/ǫ)2 ) iterations when using cut elimination which keeps the complexity of the localization set bounded. [sent-135, score-0.142]

79 We also test SVM classification performance on positive semidefinite kernels using the LIBSVM library. [sent-138, score-0.221]

80 1 Generalization We compare our method for SVM classification with indefinite kernels to several of the kernel preprocessing techniques discussed earlier. [sent-142, score-0.462]

81 The first, denoted denoise, thresholds the negative eigenvalues to zero. [sent-144, score-0.078]

82 The last transformation, shift, adds a constant to each eigenvalue making them all positive. [sent-146, score-0.102]

83 We finally also compare with using SVM on the original indefinite kernel (SVM converges but the solution is only a stationary point and is not guaranteed to be optimal). [sent-148, score-0.302]

84 74 Table 1 provides statistics including the minimum and maximum eigenvalues of the training kernels. [sent-162, score-0.078]

85 The main observation is that the USPS data uses highly indefinite kernels while the UCI data use kernels that are nearly positive semidefinite. [sent-163, score-0.384]

86 The liver data set uses a kernel that is almost positive semidefinite - this is an example where the input is almost a true kernel and Indefinite SVM finds one slightly better. [sent-172, score-0.637]

87 We postulate that our method will perform better in situations where the similarity measure is highly indefinite as in the USPS dataset, while measures that are almost positive semidefinite maybe be seen as having a small amount of noise. [sent-173, score-0.168]

88 The average results with one standard deviation above and below the mean are displayed in Figure 1 with the duality gap in log scale (note that the codes were not stopped here and that the target gap improvement is usually much smaller than 10−8 ). [sent-217, score-0.248]

89 As expected, ACCPM converges much faster (in fact linearly) to a higher precision while each iteration requires solving a linear program of size n. [sent-218, score-0.176]

90 The gradient projection method converges faster in the beginning but stalls at a higher precision, however each iteration only requires sorting the current point. [sent-219, score-0.21]

91 5 Conclusion We have proposed a technique for incorporating indefinite kernels into the SVM framework without any explicit transformations. [sent-221, score-0.192]

92 We have shown that if we view the indefinite kernel as a noisy instance of a true kernel, we can learn an explicit solution for the optimal kernel with a tractable convex optimization problem. [sent-222, score-0.646]

93 Our initial experiments show that our method can at least fare comparably with other methods handling indefinite kernels in the SVM framework but provides a much clearer interpretation for these heuristics. [sent-224, score-0.219]

94 An assessment of alternative strategies for constructing emd-based kernel functions for use in an svm for image classification. [sent-237, score-0.438]

95 Multiple kernel learning, conic duality, and the smo algorithm. [sent-291, score-0.334]

96 Permanents, transport polytopes and positive definite kernels on histograms. [sent-301, score-0.221]

97 An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines. [sent-319, score-0.395]

98 A study on sigmoid kernel for svm and the training of non-psd kernels by smo-type methods. [sent-326, score-0.601]

99 A regularization method for solving the finite convex min-max problem. [sent-342, score-0.13]

100 Convex nondifferentiable optimization: A survey focused on the analytic center cutting plane method. [sent-357, score-0.467]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('inde', 0.711), ('kernel', 0.272), ('semide', 0.223), ('svm', 0.166), ('kernels', 0.163), ('nite', 0.162), ('cutting', 0.134), ('analytic', 0.116), ('eigenvalue', 0.102), ('plane', 0.102), ('accpm', 0.089), ('gap', 0.087), ('dual', 0.086), ('proxy', 0.085), ('usps', 0.082), ('vi', 0.079), ('eigenvalues', 0.078), ('classi', 0.077), ('center', 0.076), ('duality', 0.074), ('sn', 0.074), ('convex', 0.073), ('rn', 0.068), ('tr', 0.061), ('projected', 0.06), ('positive', 0.058), ('program', 0.051), ('princeton', 0.049), ('gradient', 0.049), ('ith', 0.049), ('quadratic', 0.047), ('similarity', 0.047), ('feasible', 0.047), ('cation', 0.045), ('denoise', 0.045), ('robusti', 0.045), ('localization', 0.043), ('nonconvex', 0.043), ('iteration', 0.041), ('lanckriet', 0.041), ('inner', 0.041), ('matrix', 0.04), ('differentiable', 0.039), ('uci', 0.039), ('nondifferentiable', 0.039), ('ong', 0.039), ('polyhedral', 0.039), ('orfe', 0.039), ('solved', 0.039), ('concave', 0.038), ('plugging', 0.037), ('robust', 0.037), ('max', 0.036), ('transformation', 0.036), ('measures', 0.036), ('liver', 0.035), ('recall', 0.035), ('hulls', 0.033), ('pe', 0.033), ('projection', 0.033), ('objective', 0.033), ('subject', 0.032), ('conic', 0.031), ('smo', 0.031), ('formulate', 0.031), ('solving', 0.03), ('converges', 0.03), ('breaks', 0.03), ('sorting', 0.03), ('keeps', 0.03), ('technique', 0.029), ('optimization', 0.029), ('interpretation', 0.029), ('numerical', 0.029), ('subgradient', 0.028), ('detail', 0.028), ('ai', 0.028), ('cristianini', 0.027), ('region', 0.027), ('method', 0.027), ('cone', 0.026), ('numerically', 0.025), ('penalty', 0.025), ('accuracy', 0.025), ('complexity', 0.024), ('international', 0.024), ('precision', 0.024), ('distance', 0.024), ('eigenvector', 0.024), ('matrices', 0.024), ('convergent', 0.023), ('support', 0.023), ('explicitly', 0.023), ('maximization', 0.023), ('proceedings', 0.023), ('cut', 0.022), ('transformations', 0.022), ('handwritten', 0.022), ('stopping', 0.022), ('protein', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

2 0.14179489 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

3 0.13840529 109 nips-2007-Kernels on Attributed Pointsets with Applications

Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan

Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1

4 0.13235219 112 nips-2007-Learning Monotonic Transformations for Classification

Author: Andrew Howard, Tony Jebara

Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1

5 0.12391949 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

6 0.11828671 118 nips-2007-Learning with Transformation Invariant Kernels

7 0.10321877 134 nips-2007-Multi-Task Learning via Conic Programming

8 0.094853319 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

9 0.088649258 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

10 0.088617802 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

11 0.087746926 62 nips-2007-Convex Learning with Invariances

12 0.087463677 49 nips-2007-Colored Maximum Variance Unfolding

13 0.085437365 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

14 0.079982623 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

15 0.077326007 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

16 0.07404153 40 nips-2007-Bundle Methods for Machine Learning

17 0.070727222 7 nips-2007-A Kernel Statistical Test of Independence

18 0.067101941 99 nips-2007-Hierarchical Penalization

19 0.06533625 185 nips-2007-Stable Dual Dynamic Programming

20 0.062314283 186 nips-2007-Statistical Analysis of Semi-Supervised Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, 0.025), (2, -0.116), (3, 0.135), (4, -0.022), (5, -0.01), (6, 0.054), (7, 0.057), (8, -0.126), (9, 0.163), (10, 0.109), (11, -0.127), (12, 0.072), (13, 0.017), (14, -0.121), (15, -0.086), (16, 0.02), (17, -0.067), (18, 0.031), (19, 0.088), (20, -0.027), (21, 0.052), (22, 0.001), (23, 0.07), (24, -0.003), (25, -0.024), (26, -0.118), (27, 0.1), (28, 0.017), (29, -0.039), (30, -0.021), (31, -0.014), (32, 0.127), (33, 0.04), (34, 0.002), (35, 0.031), (36, 0.086), (37, 0.03), (38, -0.009), (39, -0.004), (40, -0.006), (41, 0.011), (42, -0.049), (43, -0.035), (44, 0.118), (45, 0.008), (46, 0.0), (47, 0.063), (48, 0.026), (49, 0.105)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97037315 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

2 0.70136213 112 nips-2007-Learning Monotonic Transformations for Classification

Author: Andrew Howard, Tony Jebara

Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1

3 0.69186509 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

4 0.6707201 109 nips-2007-Kernels on Attributed Pointsets with Applications

Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan

Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1

5 0.63999981 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

6 0.62891358 160 nips-2007-Random Features for Large-Scale Kernel Machines

7 0.62721616 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

8 0.59376901 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

9 0.58046448 24 nips-2007-An Analysis of Inference with the Universum

10 0.5599159 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

11 0.55899525 70 nips-2007-Discriminative K-means for Clustering

12 0.5472399 62 nips-2007-Convex Learning with Invariances

13 0.53395933 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

14 0.52853835 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

15 0.52442175 49 nips-2007-Colored Maximum Variance Unfolding

16 0.51181674 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

17 0.50746912 134 nips-2007-Multi-Task Learning via Conic Programming

18 0.5043726 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

19 0.45953649 7 nips-2007-A Kernel Statistical Test of Independence

20 0.44771481 40 nips-2007-Bundle Methods for Machine Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.043), (7, 0.118), (9, 0.026), (13, 0.075), (16, 0.033), (18, 0.015), (19, 0.023), (21, 0.052), (31, 0.022), (34, 0.029), (35, 0.05), (47, 0.069), (49, 0.029), (83, 0.208), (85, 0.044), (87, 0.012), (90, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94955534 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

Author: M. M. Mahmud, Sylvian Ray

Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1

2 0.9347524 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

Abstract: We present a new and efficient semi-supervised training method for parameter estimation and feature selection in conditional random fields (CRFs). In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs – a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. The objective function of sVEB combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. It reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real-world inference systems. Experiments on synthetic data and real activity traces collected from wearable sensors, illustrate that sVEB benefits from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised approaches. 1

same-paper 3 0.919613 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

4 0.90175229 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

5 0.87404519 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

6 0.87291986 40 nips-2007-Bundle Methods for Machine Learning

7 0.87067813 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

8 0.86810005 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

9 0.86397737 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

10 0.8627044 134 nips-2007-Multi-Task Learning via Conic Programming

11 0.86201489 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

12 0.86199468 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

13 0.86133778 116 nips-2007-Learning the structure of manifolds using random projections

14 0.86117518 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

15 0.86018223 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

16 0.85986382 102 nips-2007-Incremental Natural Actor-Critic Algorithms

17 0.85851723 187 nips-2007-Structured Learning with Approximate Inference

18 0.85804874 21 nips-2007-Adaptive Online Gradient Descent

19 0.85764092 84 nips-2007-Expectation Maximization and Posterior Constraints

20 0.85761607 70 nips-2007-Discriminative K-means for Clustering