nips nips2005 nips2005-161 knowledge-graph by maker-knowledge-mining

161 nips-2005-Radial Basis Function Network for Multi-task Learning

Source: pdf

Author: Xuejun Liao, Lawrence Carin

Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. [sent-5, score-0.295]

2 We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. [sent-6, score-0.166]

3 Training data may also be actively selected to improve the network’s generalization to test data. [sent-7, score-0.103]

4 Often these tasks are not independent, implying what is learned from one task is transferable to another correlated task. [sent-10, score-0.14]

5 In machine learning, the concept of explicitly exploiting the transferability of expertise between tasks, by learning the tasks simultaneously under a uniﬁed representation, is formally referred to as “multi-task learning” [1]. [sent-12, score-0.131]

6 In this paper we extend radial basis function (RBF) networks [4,5] to the scenario of multitask learning and present the corresponding learning algorithms. [sent-13, score-0.247]

7 In the other extreme, when all the tasks are independent, there is no correlation to utilize and we learn each task separately. [sent-20, score-0.149]

8 We deﬁne the structure of multi-task RBF network in Section 2 and present the supervised learning algorithm in Section 3. [sent-22, score-0.15]

9 In Section 4 we show how to learn the network structure in an unsupervised manner, and based on this we demonstrate how to actively select the training data, with the goal of improving the generalization to test data. [sent-23, score-0.249]

10 2 Multi-Task Radial Basis Function Network Figure 1 schematizes the radial basis function (RBF) network structure customized to multitask learning. [sent-25, score-0.245]

11 The network consists of an input layer, a hidden layer, and an output layer. [sent-26, score-0.148]

12 The input layer receives a data point x = [x1 , · · · , xd ]T ∈ Rd and submits it to the hidden layer. [sent-27, score-0.135]

13 Each node at the hidden layer has a localized activation φn (x) = φ(||x − cn ||, σn ), n = 1, · · · , N , where || · || denotes the vector norm and φn (·) is a radial basis function (RBF) localized around cn with the degree of localization parameterized by σn . [sent-28, score-0.305]

14 The activations of all hidden nodes are weighted and sent to the output layer. [sent-30, score-0.147]

15 Each output node represents a unique task and has its own hidden-to-output weights. [sent-31, score-0.097]

16 The weighted activations of the hidden nodes are summed at each output node to produce the output for the associated task. [sent-32, score-0.208]

17 Denoting wk = [w0k , w1k , · · · , wN k ]T as the weights connecting hidden nodes to the k-th output node, then the output for the k-th task, in response to input x, takes the form T fk (x) = wk φ(x) (1) T where φ(x) = φ0 (x), φ1 (x), . [sent-33, score-0.811]

18 , φN (x) is a column containing N + 1 basis functions with φ0 (x) ≡ 1 a dummy basis accounting for the bias in Figure 1. [sent-36, score-0.168]

19 Each task has its own hidden-to-output weights but all the tasks share the same hidden nodes. [sent-39, score-0.226]

20 The activation of hidden node n is characterized by a basis function φn (x) = φ(||x − cn ||, σn ). [sent-40, score-0.193]

21 3 Supervised Learning Suppose we have K tasks and the data set of the k-th task is Dk = {(x1k , y1k ), · · · , (xJk k , yJk k )}, where yik is the target (desired output) of xik . [sent-42, score-0.846]

22 By deﬁnition, a given data point xik is said to be supervised if the associated target yik is provided and unsupervised if yik is not provided. [sent-43, score-1.243]

23 For m = 1 : K, For n = 1 : Jm , For k = 1 : K, For i = 1 : Jk Compute φnm = φ(||xnm − xik ||, σ); ik K Jk 2 −1 2. [sent-46, score-0.566]

24 Let N = 0, φ(·) = 1, e0 = k=1 ( i=1 yik − (Jk + ρ) Jk i=1 yik )2 ; J k For k = 1 : K, compute Ak = Jk +ρ, wk = (Jk +ρ)−1 i=1 yik ; 3. [sent-47, score-1.486]

25 For m = 1 : K, For n = 1 : Jm If φnm is not marked as “deleted” For k = 1 : K, compute Jk Jk ck = i=1 φik φnm , qk = i=1 (φnm )2 + ρ − cT A−1 ck ; ik ik k k If there exists k such that qk = 0, mark φnm as “deleted”; else, compute δe(φ, φnm ) using (5). [sent-48, score-1.157]

26 For k = 1 : K new Compute Anew and wk respectively by (A-1) and (A-3) in the appendix; Upk new new date Ak ← Ak , wk ← wk N +1 9. [sent-56, score-0.522]

27 We are interested in learning the functions fk (x) for the K tasks, based on ∪K Dk . [sent-61, score-0.317]

28 k=1 The learning is based on minimizing the squared error e (φ, w) = K k=1 Jk i=1 T wk φik − yik 2 + ρ ||wk ||2 (2) where φik = φ(xik ) for notational simplicity. [sent-62, score-0.688]

29 We now discuss how to determine the hidden layer (basis functions φ). [sent-65, score-0.123]

30 Substituting the solutions of the w’s in (3) into (2) gives e(φ) = K k=1 Jk i=1 2 T yik − yik wk φik (4) where e(φ) is a function of φ only because w’s are now functions of φ as given by (3). [sent-66, score-1.069]

31 , φN (xik ) , this amounts to determining N , the number of basis functions, and the functional form of each basis function φn (·), n = 1, . [sent-71, score-0.146]

32 We learn the RBF network structure by selecting φ(·) from these candidate functions such that e(φ) in (4) is minimized. [sent-76, score-0.14]

33 By (A-2) in the Appendix, qk is a diagonal element of (Anew )−1 , therefore qk k N +1 N +1 is positive and by (5) δe(φ, φ ) > 0, which means adding φ to φ generally makes the squared error decrease. [sent-84, score-0.451]

34 By sequentially selecting basis functions that bring the maximum error reduction, we achieve the goal of maximizing e(φ). [sent-86, score-0.129]

35 In this section, we assume the data in Dk are initially unsupervised (only x is available without access to the associated y) and we select a subset from Dk to be supervised (targets acquired) such that the resulting network generalizes well to the remaining data in Dk . [sent-89, score-0.2]

36 We ﬁrst learn the basis functions φ from the unsupervised data, and based on φ select data to be supervised. [sent-91, score-0.183]

37 Theorem 2 Let there be K tasks and the data set of the k-th task is Dk ∪ Dk where Jk Dk = {(xik , yik )}Jk and Dk = {(xik , yik )}Jk +k +1 . [sent-93, score-1.02]

38 Let there be two multi-task RBF i=1 i=J ∼ networks, whose output nodes are characterized by fk (·) and fk (·), respectively, for task k = 1, . [sent-94, score-0.691]

39 The two networks have the same given basis functions (hidden nodes) φ(·) = [1, φ1 (·), · · · , φN (·)]T , but different hidden-to-output weights. [sent-98, score-0.142]

40 The weights of fk (·) ∼ are trained with Dk ∪ Dk , while the weights of fk (·) are trained using Dk . [sent-99, score-0.668]

41 Then for ∼ k = 1, · · · , K, the square errors committed on Dk by fk (·) and fk (·) are related by 0 ≤ [det Γk ]−1≤ λ−1 ≤ max,k Jk 2 −1 ∼ i=1(yik −fk (xik )) T Jk 2 −1 i=1(yik −fk (xik )) ≤ λmin,k ≤ 1 (7) 2 where Γk = I + ΦT (ρ I + Φk Φk )−1 Φk with Φ = φ(x1k ), . [sent-100, score-0.588]

42 Specializing Theorem 2 to the case Jk = 0, we have Corollary 1 Let there be K tasks and the data set of the k-th task is Dk = {(xik , yik )}Jk . [sent-107, score-0.581]

43 i=1 Let the RBF network, whose output nodes are characterized by fk (·) for task k = 1, . [sent-108, score-0.413]

44 , K, have given basis functions (hidden nodes) φ(·) = [1, φ1 (·), · · · , φN (·)]T and the hidden-to-output weights of task k be trained with Dk . [sent-111, score-0.187]

45 Then for k = 1, · · · , K, the squared error committed on Dk by fk (·) is bounded as 0 ≤ [det Γk ]−1 ≤ λ−1 max,k ≤ Jk 2 −1 i=1 yik Jk i=1 2 2 (yik − fk (xik )) ≤ λ−1 ≤ 1, where Γk = I + ρ−1 ΦT Φk with k min,k Φ = φ(x1,k ), . [sent-112, score-1.09]

46 It is evident from the properties of matrix determinant [7] and the deﬁnition of Φ that 2 2 Jk det Γk = det(ρI + Φk ΦT ) [det(ρ I)]−2 = det(ρI + i=1 φik φT ) [det(ρ I)]−2 . [sent-116, score-0.177]

47 ik k Using (3) we write succinctly det Γk = [det A2 ][det(ρ I)]−2 . [sent-117, score-0.463]

48 We are interested in sek lecting the basis functions φ that minimize the error, before seeing y’s. [sent-118, score-0.132]

49 By Corollary 1 and the equation det Γk = [det A2 ][det(ρ I)]−2 , the squared error is lower bounded by k Jk 2 yik [det(ρ I)]2 [det Ak ]−2 . [sent-119, score-0.664]

50 As [det(ρ I)]2 i=1 yik does not depend on φ, this amounts to selecting φ to −2 minimize (det Ak ) . [sent-121, score-0.469]

51 To minimize the errors for all tasks k = 1 · · · , K, we select φ to K minimize k=1 (det Ak )−2 . [sent-122, score-0.137]

52 Suppose we have selected basis funcJk T tions φ = [1, φ1 , · · · , φN ]T . [sent-124, score-0.095]

53 Augmenting basis functions to [φ , φN +1 ]T , the A Jk T N +1 T matrices change to Anew = ] [φT , φN +1 ] + ρ I(N +2)×(N +2) . [sent-126, score-0.12]

54 Usik k i=1 [φik , φik ik K ing the determinant formula of block matrices [7], we get k=1 (det Anew )−2 = k K (qk det Ak )−2 , where qk is the same as in (6). [sent-127, score-0.73]

55 As Ak does not depend on φN +1 , k=1 K 2 the left-hand side is minimized by maximizing k=1 qk . [sent-128, score-0.194]

56 The selection is easily implemented by making the following two minor modiﬁcations in Table 1: (a) in step 2, compute K K 2 e0 = k=1 ln(Jk + ρ)−2 ; in step 3, compute δe(φ, φnm ) = k=1 ln qk . [sent-129, score-0.209]

57 Based on the basis functions φ determined above, we proceed to selecting data to be supervised and determining the hidden-to-output weights w from the supervised data using the equations in (3). [sent-131, score-0.283]

58 Corollary 2 Let there be K tasks and the data set of the k-th task is Dk = {(xik , yik )}Jk . [sent-133, score-0.581]

59 i=1 Let there be two RBF networks, whose output nodes are characterized by fk (·) and + fk (·), respectively, for task k = 1, . [sent-134, score-0.691]

60 The two networks have the same given basis functions φ(·) = [1, φ1 (·), · · · , φN (·)]T , but different hidden-to-output weights. [sent-138, score-0.142]

61 + The weights of fk (·) are trained with Dk , while the weights of fk (·) are trained us+ ing Dk = Dk ∪ {(xJk +1,k , yJk +1,k )}. [sent-139, score-0.687]

62 Then for k = 1, · · · , K, the squared errors + + committed on (xJk +1,k , yJk +1,k ) by fk (·) and fk (·) are related by fk (xJk +1,k ) − yJk +1,k T 2 = γ(xJk +1,k ) −1 (xJk +1,k )A−1 φ(xJk +1,k ) k φ as in (3). [sent-140, score-0.911]

63 2 2 fk (xJk +1,k ) − yJk +1,k , where γ(xJk +1,k ) = ≥ 1 and Ak = Jk i=1 1+ T ρI + φ(xik )φ (xik ) is the same Two observations are made from Corollary 2. [sent-141, score-0.278]

64 Second, if γ(xi ) ≫ 1, seeing yJk +1,k greatly decrease the error on xJk +1,k , indicating xJk +1,k is signiﬁcantly dissimilar (novel) to Dk and xJk +1,k must be supervised to reduce the error. [sent-143, score-0.095]

65 Suppose we have selected data Dk = {(xik , yik )}Jk , from which we comi=1 pute Ak . [sent-145, score-0.477]

66 5 Experimental Results In this section we compare the multi-task RBF network against single-task RBF networks via experimental studies. [sent-150, score-0.112]

67 In the ﬁrst, which we call “one RBF network”, we let the K tasks share both basis functions φ (hidden nodes) and hidden-to output weights w, thus we do not distinguish the K tasks and design a single RBF network to learn a union of them. [sent-152, score-0.47]

68 The second is the multi-task RBF network, where the K tasks share the same φ but each has its own w. [sent-153, score-0.109]

69 We consider each school a task, leading to 139 tasks in total. [sent-163, score-0.124]

70 The multi-task RBF network is implemented as the structure as shown in Figure 1 and trained with the learning algorithm in Table 1. [sent-170, score-0.12]

71 The “one RBF network” is implemented as a special case of Figure 1, with a single output node and trained using the union of supervised data from all 139 schools. [sent-171, score-0.177]

72 We design 139 independent RBF networks, each of which is implemented with a single output node and trained using the supervised data from a single school. [sent-172, score-0.171]

73 We 2 n use the Gaussian RBF φn (x) = exp(− ||x−c2 || ), where the cn ’s are selected from training 2σ data points and σn ’s are initialized as 20 and optimized as described in Table 1. [sent-173, score-0.096]

74 The generalization performance is measured by the squared error (fk (xik ) − yik )2 averaged over all test data xik of tasks k = 1, · · · , K. [sent-177, score-0.928]

75 We made 10 independent trials to randomly split the data into training and test sets and the squared error averaged over the test data of all the 139 schools and the trials are shown in Table 2, for the three types of RBF networks. [sent-178, score-0.287]

76 Table 2: Squared error averaged over the test data of all 139 schools and the 10 independent trials for randomly splitting the school data into training (75%) and testing (25%) sets. [sent-179, score-0.235]

77 Multi-task RBF network Independent RBF networks One RBF network 109. [sent-180, score-0.177]

78 8093 Table 2 clearly shows the multi-task RBF network outperforms the other two types of RBF networks by a considerable margin. [sent-186, score-0.112]

79 The “one RBF network” ignores the difference between the tasks and the independent RBF networks ignore the tasks’ correlations, therefore they both perform inferiorly. [sent-187, score-0.167]

80 The multi-task RBF network uses the shared hidden nodes (basis functions) to capture the common internal representation of the tasks and meanwhile uses the independent hidden-to-output weights to learn the statistics speciﬁc to each task. [sent-188, score-0.321]

81 We use the method in Section 4 to actively split the data into training and test sets using a two-step procedure. [sent-190, score-0.12]

82 First we learn the basis functions φ of multi-task RBF network using all 15362 data (unsupervised). [sent-191, score-0.199]

83 Based on the φ, we then select the data to be supervised and use them as training data to learn the hidden-to-output weights w. [sent-192, score-0.193]

84 To make the results comparable, we use the same training data to learn the other two types of RBF networks (including learning their own φ and w). [sent-193, score-0.136]

85 Each curve is the squared error averaged over the test data of all 139 schools, as a function of number of training data. [sent-196, score-0.153]

86 It is clear that the multi-task RBF network maintains its superior performance all the way down to 5000 training data points, whereas the independent RBF networks have their performances degraded seriously as the training data diminish. [sent-197, score-0.226]

87 The data are split into training and test sets via active learning. [sent-201, score-0.111]

88 6 Conclusions We have presented the structure and learning algorithms for multi-task learning with the radial basis function (RBF) network. [sent-202, score-0.175]

89 By letting multiple tasks share the basis functions (hidden nodes) we impose a common internal representation for correlated tasks. [sent-203, score-0.218]

90 Unsupervised learning of the network structure enables us to actively split the data into training and test sets. [sent-205, score-0.216]

91 As the data novel to the previously selected ones are selected next, what ﬁnally remain unselected and to be tested are all similar to the selected data which constitutes the training set. [sent-206, score-0.131]

92 This improves the generalization of the resulting network to the test data. [sent-207, score-0.1]

93 Grant (1991), Orthogonal least squares learning algorithm for radial basis function networks, IEEE Transactions on Neural Networks, Vol. [sent-233, score-0.144]

94 By (3), the A matrices corresponding to φnew are φik Ak ck Anew = Jk (A-1) φT φN +1 + ρ I(N +2)×(N +2) = k ik i=1 ik cT dk φN +1 k ik new where ck and dk are as in (6). [sent-249, score-1.508]

95 Using the block matrix inversion formula [7] we get −1 −1 A−1 + A−1 ck qk cT A−1 −A−1 ck qk k k k k (A-2) (Anew )−1= k k −1 −1 qk −qk cT A−1 k k new where qk is as in (6). [sent-251, score-0.938]

96 By (3), the weights wk corresponding to [φT , φN +1 ]T are Jk i=1 yik φik Jk N +1 i=1 yik φik Jk N +1 . [sent-252, score-1.079]

97 k ik i=1 (A-3) = φT wk + φT A−1 ck ik ik k 2 new yik − yik (φnew )T wk ik K = e(φ) − k=1 cT wk k used (3) and (4) and gk − = − = Proof of Theorem 2: The proof applies to k = 1, · · · , K. [sent-255, score-2.736]

98 Hence, fk − yk = I + ΦT A−1 Φk fk − yk , which gives k k k k Jk i=1 ∼ (yik − fk (xik ))2 = (fk − yk )T (fk − yk ) = fk − yk where Γk = I + 2 ΦT A−1 Φk k k = I+ ΦT (ρ I k + T ∼ Γ−1 fk − yk k (A-4) T 2 Φk Φk )−1 Φk . [sent-276, score-2.182]

99 k Using this expansion of Γk in (A-4) we get Jk i=1 ∼ (fk (xik ) − yik )2 = fk − yk T T −1 −1 ∼ ET diag[σ1k , . [sent-280, score-0.849]

100 , σJk k ] fk − yk k 2 ∼ −1 ∼ ∼ Jk (A-5) ≤ fk − yk ET λmin,k I Ek fk − yk = λ−1 k min,k i=1(fk (xik ) − yik ) where the inequality results because λmin,k = min(λ1,k , · · · , λJk ,k ). [sent-283, score-1.686]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yik', 0.439), ('jk', 0.338), ('ik', 0.301), ('xjk', 0.287), ('fk', 0.278), ('xik', 0.265), ('rbf', 0.257), ('dk', 0.216), ('qk', 0.194), ('wk', 0.169), ('det', 0.162), ('yk', 0.132), ('ak', 0.127), ('yjk', 0.116), ('anew', 0.098), ('tasks', 0.09), ('ck', 0.074), ('basis', 0.073), ('nm', 0.072), ('network', 0.065), ('schools', 0.061), ('gk', 0.058), ('ct', 0.057), ('radial', 0.054), ('supervised', 0.054), ('deleted', 0.053), ('layer', 0.052), ('hidden', 0.049), ('networks', 0.047), ('nodes', 0.046), ('squared', 0.045), ('students', 0.039), ('multitask', 0.039), ('exam', 0.037), ('jm', 0.037), ('task', 0.036), ('output', 0.034), ('school', 0.034), ('training', 0.033), ('weights', 0.032), ('committed', 0.032), ('unsupervised', 0.03), ('actively', 0.03), ('theorem', 0.03), ('corollary', 0.029), ('ek', 0.027), ('node', 0.027), ('cn', 0.025), ('matrices', 0.025), ('durham', 0.024), ('transferability', 0.024), ('xnm', 0.024), ('trained', 0.024), ('learn', 0.023), ('seeing', 0.023), ('band', 0.023), ('union', 0.022), ('functions', 0.022), ('selected', 0.022), ('duke', 0.021), ('gender', 0.021), ('specified', 0.021), ('test', 0.021), ('active', 0.021), ('en', 0.02), ('averaged', 0.02), ('split', 0.02), ('ece', 0.019), ('characterized', 0.019), ('ing', 0.019), ('marked', 0.019), ('share', 0.019), ('select', 0.019), ('appendix', 0.018), ('xd', 0.018), ('activations', 0.018), ('vr', 0.018), ('targets', 0.018), ('error', 0.018), ('categories', 0.018), ('table', 0.017), ('inequality', 0.017), ('learning', 0.017), ('data', 0.016), ('nc', 0.016), ('student', 0.016), ('independent', 0.016), ('selecting', 0.016), ('respectively', 0.015), ('go', 0.015), ('proof', 0.015), ('arg', 0.015), ('determinant', 0.015), ('selection', 0.015), ('ignores', 0.014), ('formula', 0.014), ('generalization', 0.014), ('minimize', 0.014), ('structure', 0.014), ('correlated', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

2 0.15115528 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic ﬂow. This makes the approach an efﬁcient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

3 0.1137518 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The deﬁnition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

4 0.079150319 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1

5 0.076633886 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

Author: Frank Wood, Stefan Roth, Michael J. Black

Abstract: Probabilistic modeling of correlated neural population ﬁring activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for ﬁnding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is signiﬁcantly higher than under other models. These results suggest that our model captures correlations in the ﬁring activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1

6 0.06897369 195 nips-2005-Transfer learning for text classification

7 0.068962246 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

8 0.053363621 50 nips-2005-Convex Neural Networks

9 0.052565731 80 nips-2005-Gaussian Process Dynamical Models

10 0.050036289 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

11 0.04372634 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

12 0.040254802 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

13 0.03907264 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

14 0.039068609 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

15 0.03864637 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

16 0.03704061 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

17 0.03626205 153 nips-2005-Policy-Gradient Methods for Planning

18 0.034943983 117 nips-2005-Learning from Data of Variable Quality

19 0.034733269 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

20 0.034552075 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.119), (1, 0.016), (2, -0.02), (3, -0.014), (4, 0.028), (5, -0.016), (6, 0.002), (7, 0.034), (8, 0.041), (9, -0.019), (10, 0.007), (11, -0.026), (12, 0.006), (13, -0.061), (14, -0.05), (15, 0.076), (16, 0.028), (17, 0.041), (18, 0.055), (19, -0.0), (20, 0.018), (21, 0.003), (22, 0.015), (23, 0.153), (24, -0.044), (25, 0.133), (26, -0.108), (27, 0.177), (28, 0.242), (29, -0.241), (30, 0.173), (31, 0.029), (32, 0.192), (33, -0.192), (34, 0.082), (35, 0.016), (36, 0.021), (37, -0.105), (38, -0.061), (39, -0.004), (40, 0.035), (41, -0.032), (42, 0.147), (43, -0.008), (44, -0.033), (45, -0.186), (46, 0.115), (47, -0.006), (48, 0.113), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96589494 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

2 0.60221857 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

3 0.5675661 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

4 0.46227023 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

5 0.3376416 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we deﬁne what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of ﬁnding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identiﬁed and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classiﬁcation. A numerical study demonstrates the usefulness of our method. 1

6 0.28870961 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

7 0.25526211 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

8 0.25159985 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

9 0.23321712 195 nips-2005-Transfer learning for text classification

10 0.23178494 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

11 0.22204982 117 nips-2005-Learning from Data of Variable Quality

12 0.21736975 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

13 0.21438599 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

14 0.21425599 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning

15 0.21185325 108 nips-2005-Layered Dynamic Textures

16 0.21161069 50 nips-2005-Convex Neural Networks

17 0.20719445 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

18 0.19993331 19 nips-2005-Active Learning for Misspecified Models

19 0.19545265 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

20 0.19490156 159 nips-2005-Q-Clustering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.049), (5, 0.369), (10, 0.049), (27, 0.026), (31, 0.031), (34, 0.077), (55, 0.026), (65, 0.01), (69, 0.056), (73, 0.039), (88, 0.125), (91, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7859121 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

2 0.78107309 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

Author: Odelia Schwartz, Peter Dayan, Terrence J. Sejnowski

Abstract: The misjudgement of tilt in images lies at the heart of entertaining visual illusions and rigorous perceptual psychophysics. A wealth of ﬁndings has attracted many mechanistic models, but few clear computational principles. We adopt a Bayesian approach to perceptual tilt estimation, showing how a smoothness prior offers a powerful way of addressing much confusing data. In particular, we faithfully model recent results showing that conﬁdence in estimation can be systematically affected by the same aspects of images that affect bias. Conﬁdence is central to Bayesian modeling approaches, and is applicable in many other perceptual domains. Perceptual anomalies and illusions, such as the misjudgements of motion and tilt evident in so many psychophysical experiments, have intrigued researchers for decades.1–3 A Bayesian view4–8 has been particularly inﬂuential in models of motion processing, treating such anomalies as the normative product of prior information (often statistically codifying Gestalt laws) with likelihood information from the actual scenes presented. Here, we expand the range of statistically normative accounts to tilt estimation, for which there are classes of results (on estimation conﬁdence) that are so far not available for motion. The tilt illusion arises when the perceived tilt of a center target is misjudged (ie bias) in the presence of ﬂankers. Another phenomenon, called Crowding, refers to a loss in the conﬁdence (ie sensitivity) of perceived target tilt in the presence of ﬂankers. Attempts have been made to formalize these phenomena quantitatively. Crowding has been modeled as compulsory feature pooling (ie averaging of orientations), ignoring spatial positions.9, 10 The tilt illusion has been explained by lateral interactions11, 12 in populations of orientationtuned units; and by calibration.13 However, most models of this form cannot explain a number of crucial aspects of the data. First, the geometry of the positional arrangement of the stimuli affects attraction versus repulsion in bias, as emphasized by Kapadia et al14 (ﬁgure 1A), and others.15, 16 Second, Solomon et al. recently measured bias and sensitivity simultaneously.11 The rich and surprising range of sensitivities, far from ﬂat as a function of ﬂanker angles (ﬁgure 1B), are outside the reach of standard models. Moreover, current explanations do not offer a computational account of tilt perception as the outcome of a normative inference process. Here, we demonstrate that a Bayesian framework for orientation estimation, with a prior favoring smoothness, can naturally explain a range of seemingly puzzling tilt data. We explicitly consider both the geometry of the stimuli, and the issue of conﬁdence in the esti- 6 5 4 3 2 1 0 -1 -2 (B) Attraction Repulsion Sensititvity (1/deg) Bias (deg) (A) 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 -20 0 20 40 60 80 Flanker tilt (deg) Figure 1: Tilt biases and sensitivities in visual perception. (A) Kapadia et al demonstrated the importance of geometry on tilt bias, with bar stimuli in the fovea (and similar results in the periphery). When 5 degrees clockwise ﬂankers are arranged colinearly, the center target appears attracted in the direction of the ﬂankers; when ﬂankers are lateral, the target appears repulsed. Data are an average of 5 subjects.14 (B) Solomon et al measured both biases and sensitivities for gratings in the visual periphery.11 On the top are example stimuli, with ﬂankers tilted 22.5 degrees clockwise. This constitutes the classic tilt illusion, with a repulsive bias percept. In addition, sensitivities vary as a function of ﬂanker angles, in a systematic way (even in cases when there are no biases at all). Sensitivities are given in units of the inverse of standard deviation of the tilt estimate. More detailed data for both experiments are shown in the results section. mation. Bayesian analyses have most frequently been applied to bias. Much less attention has been paid to the equally important phenomenon of sensitivity. This aspect of our model should be applicable to other perceptual domains. In section 1 we formulate the Bayesian model. The prior is determined by the principle of creating a smooth contour between the target and ﬂankers. We describe how to extract the bias and sensitivity. In section 2 we show experimental data of Kapadia et al and Solomon et al, alongside the model simulations, and demonstrate that the model can account for both geometry, and bias and sensitivity measurements in the data. Our results suggest a more uniﬁed, rational, approach to understanding tilt perception. 1 Bayesian model Under our Bayesian model, inference is controlled by the posterior distribution over the tilt of the target element. This comes from the combination of a prior favoring smooth conﬁgurations of the ﬂankers and target, and the likelihood associated with the actual scene. A complete distribution would consider all possible angles and relative spatial positions of the bars, and marginalize the posterior over all but the tilt of the central element. For simplicity, we make two benign approximations: conditionalizing over (ie clamping) the angles of the ﬂankers, and exploring only a small neighborhood of their positions. We now describe the steps of inference. Smoothness prior: Under these approximations, we consider a given actual conﬁguration (see ﬁg 2A) of ﬂankers f1 = (φ1 , x1 ), f2 = (φ2 , x2 ) and center target c = (φc , xc ), arranged from top to bottom. We have to generate a prior over φc and δ1 = x1 − xc and δ2 = x2 − xc based on the principle of smoothness. As a less benign approximation, we do this in two stages: articulating a principle that determines a single optimal conﬁguration; and generating a prior as a mixture of a Gaussian about this optimum and a uniform distribution, with the mixing proportion of the latter being determined by the smoothness of the optimum. Smoothness has been extensively studied in the computer vision literature.17–20 One widely (B) (C) f1 f1 β1 R Probability max smooth Max smooth target (deg) (A) 40 20 0 -20 c δ1 c -40 Φc f2 f2 1 0.8 0.6 0.4 0.2 0 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -80 -60 -40 20 0 20 40 Flanker tilt (deg) 60 80 Figure 2: Geometry and smoothness for ﬂankers, f1 and f2 , and center target, c. (A) Example actual conﬁguration of ﬂankers and target, aligned along the y axis from top to bottom. (B) The elastica procedure can rotate the target angle (to Φc ) and shift the relative ﬂanker and target positions on the x axis (to δ1 and δ2 ) in its search for the maximally smooth solution. Small spatial shifts (up to 1/15 the size of R) of positions are allowed, but positional shift is overemphasized in the ﬁgure for visibility. (C) Top: center tilt that results in maximal smoothness, as a function of ﬂanker tilt. Boxed cartoons show examples for given ﬂanker tilts, of the optimally smooth conﬁguration. Note attraction of target towards ﬂankers for small ﬂanker angles; here ﬂankers and target are positioned in a nearly colinear arrangement. Note also repulsion of target away from ﬂankers for intermediate ﬂanker angles. Bottom: P [c, f1 , f2 ] for center tilt that yields maximal smoothness. The y axis is normalized between 0 and 1. used principle, elastica, known even to Euler, has been applied to contour completion21 and other computer vision applications.17 The basic idea is to ﬁnd the curve with minimum energy (ie, square of curvature). Sharon et al19 showed that the elastica function can be well approximated by a number of simpler forms. We adopt a version that Leung and Malik18 adopted from Sharon et al.19 We assume that the probability for completing a smooth curve, can be factorized into two terms: P [c, f1 , f2 ] = G(c, f1 )G(c, f2 ) (1) with the term G(c, f1 ) (and similarly, G(c, f2 )) written as: R Dβ 2 2 Dβ = β1 + βc − β1 βc (2) − ) where σR σβ and β1 (and similarly, βc ) is the angle between the orientation at f1 , and the line joining f1 and c. The distance between the centers of f1 and c is given by R. The two constants, σβ and σR , control the relative contribution to smoothness of the angle versus the spatial distance. Here, we set σβ = 1, and σR = 1.5. Figure 2B illustrates an example geometry, in which φc , δ1 , and δ2 , have been shifted from the actual scene (of ﬁgure 2A). G(c, f1 ) = exp(− We now estimate the smoothest solution for given conﬁgurations. Figure 2C shows for given ﬂanker tilts, the center tilt that yields maximal smoothness, and the corresponding probability of smoothness. For near vertical ﬂankers, the spatial lability leads to very weak attraction and high probability of smoothness. As the ﬂanker angle deviates farther from vertical, there is a large repulsion, but also lower probability of smoothness. These observations are key to our model: the maximally smooth center tilt will inﬂuence attractive and repulsive interactions of tilt estimation; the probability of smoothness will inﬂuence the relative weighting of the prior versus the likelihood. From the smoothness principle, we construct a two dimensional prior (ﬁgure 3A). One dimension represents tilt, the other dimension, the overall positional shift between target (B) Likelihood (D) Marginalized Posterior (C) Posterior 20 0.03 10 -10 -20 0 Probability 0 10 Angle Angle Angle 10 0 -10 -20 0.01 -10 -20 0.02 0 -0. 2 0 Position 0.2 (E) Psychometric function 20 -0. 2 0 0.2 -0. 2 0 0.2 Position Position -10 -5 0 Angle 5 10 Probability clockwise (A) Prior 20 1 0.8 0.6 0.4 0.2 0 -20 -10 0 10 20 Target angle (deg) Counter-clockwise Clockwise Figure 3: Bayes model for example ﬂankers and target. (A) Prior 2D distribution for ﬂankers set at 22.5 degrees (note repulsive preference for -5.5 degrees). (B) Likelihood 2D distribution for a target tilt of 3 degrees; (C) Posterior 2D distribution. All 2D distributions are drawn on the same grayscale range, and the presence of a larger baseline in the prior causes it to appear more dimmed. (D) Marginalized posterior, resulting in 1D distribution over tilt. Dashed line represents the mean, with slight preference for negative angle. (E) For this target tilt, we calculate probability clockwise, and obtain one point on psychometric curve. and ﬂankers (called ’position’). The prior is a 2D Gaussian distribution, sat upon a constant baseline.22 The Gaussian is centered at the estimated smoothest target angle and relative position, and the baseline is determined by the probability of smoothness. The baseline, and its dependence on the ﬂanker orientation, is a key difference from Weiss et al’s Gaussian prior for smooth, slow motion. It can be seen as a mechanism to allow segmentation (see Posterior description below). The standard deviation of the Gaussian is a free parameter. Likelihood: The likelihood over tilt and position (ﬁgure 3B) is determined by a 2D Gaussian distribution with an added baseline.22 The Gaussian is centered at the actual target tilt; and at a position taken as zero, since this is the actual position, to which the prior is compared. The standard deviation and baseline constant are free parameters. Posterior and marginalization: The posterior comes from multiplying likelihood and prior (ﬁgure 3C) and then marginalizing over position to obtain a 1D distribution over tilt. Figure 3D shows an example in which this distribution is bimodal. Other likelihoods, with closer agreement between target and smooth prior, give unimodal distributions. Note that the bimodality is a direct consequence of having an added baseline to the prior and likelihood (if these were Gaussian without a baseline, the posterior would always be Gaussian). The viewer is effectively assessing whether the target is associated with the same object as the ﬂankers, and this is reﬂected in the baseline, and consequently, in the bimodality, and conﬁdence estimate. We deﬁne α as the mean angle of the 1D posterior distribution (eg, value of dashed line on the x axis), and β as the height of the probability distribution at that mean angle (eg, height of dashed line). The term β is an indication of conﬁdence in the angle estimate, where for larger values we are more certain of the estimate. Decision of probability clockwise: The probability of a clockwise tilt is estimated from the marginalized posterior: 1 P = 1 + exp (3) −α.∗k − log(β+η) where α and β are deﬁned as above, k is a free parameter and η a small constant. Free parameters are set to a single constant value for all ﬂanker and center conﬁgurations. Weiss et al use a similar compressive nonlinearity, but without the term β. We also tried a decision function that integrates the posterior, but the resulting curves were far from the sigmoidal nature of the data. Bias and sensitivity: For one target tilt, we generate a single probability and therefore a single point on the psychometric function relating tilt to the probability of choosing clockwise. We generate the full psychometric curve from all target tilts and ﬁt to it a cumulative 60 40 20 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (C) Data -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (D) Model 100 100 100 80 0 -10 Model Frequency responding clockwise (B) Data Frequency responding clockwise Frequency responding clockwise Frequency responding clockwise (A) 100 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 -5 0 5 10 Target tilt (deg) Figure 4: Kapadia et al data,14 versus Bayesian model. Solid lines are ﬁts to a cumulative Gaussian distribution. (A) Flankers are tilted 5 degrees clockwise (black curve) or anti-clockwise (gray) of vertical, and positioned spatially in a colinear arrangement. The center bar appears tilted in the direction of the ﬂankers (attraction), as can be seen by the attractive shift of the psychometric curve. The boxed stimuli cartoon illustrates a vertical target amidst the ﬂankers. (B) Model for colinear bars also produces attraction. (C) Data and (D) model for lateral ﬂankers results in repulsion. All data are collected in the fovea for bars. Gaussian distribution N (µ, σ) (ﬁgure 3E). The mean µ of the ﬁt corresponds to the bias, 1 and σ to the sensitivity, or conﬁdence in the bias. The ﬁt to a cumulative Gaussian and extraction of these parameters exactly mimic psychophysical procedures.11 2 Results: data versus model We ﬁrst consider the geometry of the center and ﬂanker conﬁgurations, modeling the full psychometric curve for colinear and parallel ﬂanks (recall that ﬁgure 1A showed summary biases). Figure 4A;B demonstrates attraction in the data and model; that is, the psychometric curve is shifted towards the ﬂanker, because of the nature of smooth completions for colinear ﬂankers. Figure 4C;D shows repulsion in the data and model. In this case, the ﬂankers are arranged laterally instead of colinearly. The smoothest solution in the model arises by shifting the target estimate away from the ﬂankers. This shift is rather minor, because the conﬁguration has a low probability of smoothness (similar to ﬁgure 2C), and thus the prior exerts only a weak effect. The above results show examples of changes in the psychometric curve, but do not address both bias and, particularly, sensitivity, across a whole range of ﬂanker conﬁgurations. Figure 5 depicts biases and sensitivity from Solomon et al, versus the Bayes model. The data are shown for a representative subject, but the qualitative behavior is consistent across all subjects tested. In ﬁgure 5A, bias is shown, for the condition that both ﬂankers are tilted at the same angle. The data exhibit small attraction at near vertical ﬂanker angles (this arrangement is close to colinear); large repulsion at intermediate ﬂanker angles of 22.5 and 45 degrees from vertical; and minimal repulsion at large angles from vertical. This behavior is also exhibited in the Bayes model (Figure 5B). For intermediate ﬂanker angles, the smoothest solution in the model is repulsive, and the effect of the prior is strong enough to induce a signiﬁcant repulsion. For large angles, the prior exerts almost no effect. Interestingly, sensitivity is far from ﬂat in both data and model. In the data (Figure 5C), there is most loss in sensitivity at intermediate ﬂanker angles of 22.5 and 45 degrees (ie, the subject is less certain); and sensitivity is higher for near vertical or near horizontal ﬂankers. The model shows the same qualitative behavior (Figure 5D). In the model, there are two factors driving sensitivity: one is the probability of completing a smooth curvature for a given ﬂanker conﬁguration, as in Figure 2B; this determines the strength of the prior. The other factor is certainty in a particular center estimation; this is determined by β, derived from the posterior distribution, and incorporated into the decision stage of the model Data 5 0 -60 -40 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 60 80 -60 -40 0.6 0.5 0.4 0.3 0.2 0.1 -20 0 20 40 Flanker tilt (deg) 60 80 60 80 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -20 0 20 40 Flanker tilt (deg) 60 80 (F) Bias (deg) 10 5 0 -5 0.6 0.5 0.4 0.3 0.2 0.1 -80 (D) 10 -10 -10 80 Sensitivity (1/deg) -80 5 0 -5 -80 -80 -60 -60 -40 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 -10 80 (H) 60 80 Sensitivity (1/deg) Sensititvity (1/deg) Bias (deg) 0.6 0.5 0.4 0.3 0.2 0.1 (G) Sensititvity (1/deg) 0 -5 (C) (E) 5 -5 -10 Model (B) 10 Bias (deg) Bias (deg) (A) 10 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 Figure 5: Solomon et al data11 (subject FF), versus Bayesian model. (A) Data and (B) model biases with same-tilted ﬂankers; (C) Data and (D) model sensitivities with same-tilted ﬂankers; (E;G) data and (F;H) model as above, but for opposite-tilted ﬂankers (note that opposite-tilted data was collected for less ﬂanker angles). Each point in the ﬁgure is derived by ﬁtting a cummulative Gaussian distribution N (µ, σ) to corresponding psychometric curve, and setting bias 1 equal to µ and sensitivity to σ . In all experiments, ﬂanker and target gratings are presented in the visual periphery. Both data and model stimuli are averages of two conﬁgurations, on the left hand side (9 O’clock position) and right hand side (3 O’clock position). The conﬁgurations are similar to Figure 1 (B), but slightly shifted according to an iso-eccentric circle, so that all stimuli are similarly visible in the periphery. (equation 3). For ﬂankers that are far from vertical, the prior has minimal effect because one cannot ﬁnd a smooth solution (eg, the likelihood dominates), and thus sensitivity is higher. The low sensitivity at intermediate angles arises because the prior has considerable effect; and there is conﬂict between the prior (tilt, position), and likelihood (tilt, position). This leads to uncertainty in the target angle estimation . For ﬂankers near vertical, the prior exerts a strong effect; but there is less conﬂict between the likelihood and prior estimates (tilt, position) for a vertical target. This leads to more conﬁdence in the posterior estimate, and therefore, higher sensitivity. The only aspect that our model does not reproduce is the (more subtle) sensitivity difference between 0 and +/- 5 degree ﬂankers. Figure 5E-H depict data and model for opposite tilted ﬂankers. The bias is now close to zero in the data (Figure 5E) and model (Figure 5F), as would be expected (since the maximally smooth angle is now always roughly vertical). Perhaps more surprisingly, the sensitivities continue to to be non-ﬂat in the data (Figure 5G) and model (Figure 5H). This behavior arises in the model due to the strength of prior, and positional uncertainty. As before, there is most loss in sensitivity at intermediate angles. Note that to ﬁt Kapadia et al, simulations used a constant parameter of k = 9 in equation 3, whereas for the Solomon et al. simulations, k = 2.5. This indicates that, in our model, there was higher conﬁdence in the foveal experiments than in the peripheral ones. 3 Discussion We applied a Bayesian framework to the widely studied tilt illusion, and demonstrated the model on examples from two different data sets involving foveal and peripheral estimation. Our results support the appealing hypothesis that perceptual misjudgements are not a consequence of poor system design, but rather can be described as optimal inference.4–8 Our model accounts correctly for both attraction and repulsion, determined by the smoothness prior and the geometry of the scene. We emphasized the issue of estimation conﬁdence. The dataset showing how conﬁdence is affected by the same issues that affect bias,11 was exactly appropriate for a Bayesian formulation; other models in the literature typically do not incorporate conﬁdence in a thoroughly probabilistic manner. In fact, our model ﬁts the conﬁdence (and bias) data more proﬁciently than an account based on lateral interactions among a population of orientationtuned cells.11 Other Bayesian work, by Stocker et al,6 utilized the full slope of the psychometric curve in ﬁtting a prior and likelihood to motion data, but did not examine the issue of conﬁdence. Estimation conﬁdence plays a central role in Bayesian formulations as a whole. Understanding how priors affect conﬁdence should have direct bearing on many other Bayesian calculations such as multimodal integration.23 Our model is obviously over-simpliﬁed in a number of ways. First, we described it in terms of tilts and spatial positions; a more complete version should work in the pixel/ﬁltering domain.18, 19 We have also only considered two ﬂanking elements; the model is extendible to a full-ﬁeld surround, whereby smoothness operates along a range of geometric directions, and some directions are more (smoothly) dominant than others. Second, the prior is constructed by summarizing the maximal smoothness information; a more probabilistically correct version should capture the full probability of smoothness in its prior. Third, our model does not incorporate a formal noise representation; however, sensitivities could be inﬂuenced both by stimulus-driven noise and conﬁdence. Fourth, our model does not address attraction in the so-called indirect tilt illusion, thought to be mediated by a different mechanism. Finally, we have yet to account for neurophysiological data within this framework, and incorporate constraints at the neural implementation level. However, versions of our computations are oft suggested for intra-areal and feedback cortical circuits; and smoothness principles form a key part of the association ﬁeld connection scheme in Li’s24 dynamical model of contour integration in V1. Our model is connected to a wealth of literature in computer vision and perception. Notably, occlusion and contour completion might be seen as the extreme example in which there is no likelihood information at all for the center target; a host of papers have shown that under these circumstances, smoothness principles such as elastica and variants explain many aspects of perception. The model is also associated with many studies on contour integration motivated by Gestalt principles;25, 26 and exploration of natural scene statistics and Gestalt,27, 28 including the relation to contour grouping within a Bayesian framework.29, 30 Indeed, our model could be modiﬁed to include a prior from natural scenes. There are various directions for the experimental test and reﬁnement of our model. Most pressing is to determine bias and sensitivity for different center and ﬂanker contrasts. As in the case of motion, our model predicts that when there is more uncertainty in the center element, prior information is more dominant. Another interesting test would be to design a task such that the center element is actually part of a different ﬁgure and unrelated to the ﬂankers; our framework predicts that there would be minimal bias, because of segmentation. Our model should also be applied to other tilt-based illusions such as the Fraser spiral and Z¨ llner. Finally, our model can be applied to other perceptual domains;31 and given o the apparent similarities between the tilt illusion and the tilt after-effect, we plan to extend the model to adaptation, by considering smoothness in time as well as space. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Serge Belongie, Leanne Chukoskie, Philip Meier and Joshua Solomon for helpful discussions. References [1] J J Gibson. Adaptation, after-effect, and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20:553–569, 1937. [2] C Blakemore, R H S Carpentar, and M A Georgeson. Lateral inhibition between orientation detectors in the human visual system. Nature, 228:37–39, 1970. [3] J A Stuart and H M Burian. A study of separation difﬁculty: Its relationship to visual acuity in normal and amblyopic eyes. American Journal of Ophthalmology, 53:471–477, 1962. [4] A Yuille and H H Bulthoff. Perception as bayesian inference. In Knill and Whitman, editors, Bayesian decision theory and psychophysics, pages 123–161. Cambridge University Press, 1996. [5] Y Weiss, E P Simoncelli, and E H Adelson. Motion illusions as optimal percepts. Nature Neuroscience, 5:598–604, 2002. [6] A Stocker and E P Simoncelli. Constraining a bayesian model of human visual speed perception. Adv in Neural Info Processing Systems, 17, 2004. [7] D Kersten, P Mamassian, and A Yuille. Object perception as bayesian inference. Annual Review of Psychology, 55:271–304, 2004. [8] K Kording and D Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244–247, 2004. [9] L Parkes, J Lund, A Angelucci, J Solomon, and M Morgan. Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4:739–744, 2001. [10] D G Pelli, M Palomares, and N J Majaj. Crowding is unlike ordinary masking: Distinguishing feature integration from detection. Journal of Vision, 4:1136–1169, 2002. [11] J Solomon, F M Felisberti, and M Morgan. Crowding and the tilt illusion: Toward a uniﬁed account. Journal of Vision, 4:500–508, 2004. [12] J A Bednar and R Miikkulainen. Tilt aftereffects in a self-organizing model of the primary visual cortex. Neural Computation, 12:1721–1740, 2000. [13] C W Clifford, P Wenderoth, and B Spehar. A functional angle on some after-effects in cortical vision. Proc Biol Sci, 1454:1705–1710, 2000. [14] M K Kapadia, G Westheimer, and C D Gilbert. Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J Neurophysiology, 4:2048–262, 2000. [15] C C Chen and C W Tyler. Lateral modulation of contrast discrimination: Flanker orientation effects. Journal of Vision, 2:520–530, 2002. [16] I Mareschal, M P Sceniak, and R M Shapley. Contextual inﬂuences on orientation discrimination: binding local and global cues. Vision Research, 41:1915–1930, 2001. [17] D Mumford. Elastica and computer vision. In Chandrajit Bajaj, editor, Algebraic geometry and its applications. Springer Verlag, 1994. [18] T K Leung and J Malik. Contour continuity in region based image segmentation. In Proc. ECCV, pages 544–559, 1998. [19] E Sharon, A Brandt, and R Basri. Completion energies and scale. IEEE Pat. Anal. Mach. Intell., 22(10), 1997. [20] S W Zucker, C David, A Dobbins, and L Iverson. The organization of curve detection: coarse tangent ﬁelds. Computer Graphics and Image Processing, 9(3):213–234, 1988. [21] S Ullman. Filling in the gaps: the shape of subjective contours and a model for their generation. Biological Cybernetics, 25:1–6, 1976. [22] G E Hinton and A D Brown. Spiking boltzmann machines. Adv in Neural Info Processing Systems, 12, 1998. [23] R A Jacobs. What determines visual cue reliability? Trends in Cognitive Sciences, 6:345–350, 2002. [24] Z Li. A saliency map in primary visual cortex. Trends in Cognitive Science, 6:9–16, 2002. [25] D J Field, A Hayes, and R F Hess. Contour integration by the human visual system: evidence for a local “association ﬁeld”. Vision Research, 33:173–193, 1993. [26] J Beck, A Rosenfeld, and R Ivry. Line segregation. Spatial Vision, 4:75–101, 1989. [27] M Sigman, G A Cecchi, C D Gilbert, and M O Magnasco. On a common circle: Natural scenes and gestalt rules. PNAS, 98(4):1935–1940, 2001. [28] S Mahumad, L R Williams, K K Thornber, and K Xu. Segmentation of multiple salient closed contours from real images. IEEE Pat. Anal. Mach. Intell., 25(4):433–444, 1997. [29] W S Geisler, J S Perry, B J Super, and D P Gallogly. Edge co-occurence in natural images predicts contour grouping performance. Vision Research, 6:711–724, 2001. [30] J H Elder and R M Goldberg. Ecological statistics of gestalt laws for the perceptual organization of contours. Journal of Vision, 4:324–353, 2002. [31] S R Lehky and T J Sejnowski. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10:2281–2299, 1990.

3 0.59949672 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inﬁnity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held ﬁxed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classiﬁcation error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the ﬁrst time to be a consistent density level set estimator. 1

4 0.42632192 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia

Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

5 0.42504835 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an inﬁnite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian speciﬁcation of the effective ‘gating network’ for the different experts. 1

6 0.42392749 45 nips-2005-Conditional Visual Tracking in Kernel Space

7 0.4221383 74 nips-2005-Faster Rates in Regression via Active Learning

8 0.42198649 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

9 0.42117709 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

10 0.4198519 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

11 0.41741374 23 nips-2005-An Application of Markov Random Fields to Range Sensing

12 0.41733333 144 nips-2005-Off-policy Learning with Options and Recognizers

13 0.41633534 195 nips-2005-Transfer learning for text classification

14 0.41609004 151 nips-2005-Pattern Recognition from One Example by Chopping

15 0.41543099 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

16 0.41537961 136 nips-2005-Noise and the two-thirds power Law

17 0.41522831 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

18 0.4143441 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

19 0.41372693 30 nips-2005-Assessing Approximations for Gaussian Process Classification

20 0.41281608 48 nips-2005-Context as Filtering