nips nips2009 nips2009-227 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zahi Karam, Douglas Sturim, William M. Campbell
Abstract: Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. [sent-12, score-0.613]
2 Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. [sent-13, score-0.169]
3 For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). [sent-14, score-0.162]
4 We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. [sent-16, score-0.843]
5 The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. [sent-17, score-0.224]
6 We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques. [sent-20, score-0.673]
7 1 Introduction Comparing speakers in speech signals is a common operation in many applications including forensic speaker recognition, speaker clustering, and speaker verification. [sent-21, score-1.763]
8 Recent popular approaches to text-independent comparison include Gaussian mixture models (GMMs) [1], support vector machines [2, 3], and combinations of these techniques. [sent-22, score-0.131]
9 Comparing speech utterances with kernel functions has been a common theme in the speaker recognition SVM literature [2, 3, 4]. [sent-26, score-0.795]
10 The kernel function is an inner product which induces a metric on the set of vectors, so comparison is analogous to finding the distances between SVM expansion vectors. [sent-33, score-0.371]
11 A recent trend in the speaker recognition literature has been to move towards a more linear geometric view for non-SVM systems. [sent-34, score-0.64]
12 Also, comparison of utterances via linear scoring is presented in [6]. [sent-36, score-0.205]
13 These approaches have introduced many new ideas and perform well in speaker comparison tasks. [sent-37, score-0.627]
14 An unrealized effort in speaker recognition is to bridge the gap between SVMs and some of the new proposed GMM methods. [sent-38, score-0.609]
15 One difficulty is that most SVM kernel functions in speaker comparison satisfy the Mercer condition. [sent-39, score-0.686]
16 This restricts the scope of investigation of potential comparison strategies for two speaker utterances. [sent-40, score-0.627]
17 Therefore, in this paper, we introduce the idea of inner product discriminant functions (IPDFs). [sent-41, score-0.248]
18 First, we map input utterances to vectors of fixed dimension. [sent-43, score-0.142]
19 Typically, this compensation takes the form of a linear transform. [sent-45, score-0.214]
20 Third, we compare two compensated vectors with an inner product. [sent-46, score-0.216]
21 First, we show that many of the common techniques such as factor analysis, nuisance projection, and various types of scoring can be placed in the framework. [sent-49, score-0.236]
22 Second, we systematically describe the various inner product and compensation techniques used in the literature. [sent-50, score-0.389]
23 In Section 2, we describe the general setup for speaker comparison using GMMs. [sent-54, score-0.627]
24 Section 4 explores inner products for the IPDF framework. [sent-56, score-0.179]
25 In Section 6, we perform experiments on the NIST 2006 speaker recognition evaluation and explore different combinations of IPDF comparisons and compensations. [sent-58, score-0.609]
26 2 Speaker Comparison A standard distribution used for text-independent speaker recognition is the Gaussian mixture model [1], N λi N (x|mi , Σi ). [sent-59, score-0.671]
27 We map a sequence of feature vectors, xNx , from a speaker to a GMM by adapting a GMM universal 1 N background model (UBM). [sent-61, score-0.583]
28 The adaptation yields new parameters which we stack into a parameter vector, ax , where ax = λt x = λx,1 mt x ··· t (2) λx,N mt x,1 ··· t mt x,N . [sent-66, score-0.432]
29 (3) In speaker comparison, the problem is to compare two sequences of feature vectors, e. [sent-67, score-0.558]
30 To compare these two sequences, we adapt a GMM UBM to produce two sets of parameter vectors, ax and ay , as in (2). [sent-70, score-0.244]
31 The goal of our speaker comparison process can now be recast as a function that compares the two parameter vectors, C(ax , ay ), and produces a value that reflects the similarity of the speakers. [sent-71, score-0.771]
32 2 3 Inner Product Discriminant Functions The basic framework we propose for speaker comparison functions is composed of two parts— compensation and comparison. [sent-73, score-0.841]
33 For compensation, the parameter vectors generated by adaptation in (2) can be transformed to remove nuisances or projected onto a speaker subspace. [sent-74, score-0.778]
34 For the comparison of parameter vectors, we will consider natural distances that result in inner products between parameter vectors. [sent-76, score-0.248]
35 We propose the following inner product discriminant function (IPDF) framework for exploring speaker comparison, C(ax , ay ) = (Lx ax )t D2 (Ly ay ) (4) where Lx , Ly are linear transforms and potentially dependent on λx and/or λy . [sent-77, score-1.171]
36 Several questions from this framework are: 1) what inner product gives the best speaker comparison performance, 2) what compensation strategy works best, 3) what tradeoffs can be made between accuracy and computational cost, and 4) how do the compensation and the inner product interact. [sent-82, score-1.405]
37 4 Inner Products for IPDFs In general, an inner product of the parameters should be based on a distance arising from a statistical comparison. [sent-84, score-0.175]
38 , for the GMMs, gx and gy , with parameters ax and ay , N D(gx gy ) ≤ λx,i D (N (·; mx,i , Σi ) N (·; my,i , Σi )) . [sent-91, score-0.316]
39 i ds (ax , ay ) = Ds (λx λy ) + (6) i=1 We focus on the second term in (6) for this paper, but note that the first term could also be converted to a comparison function on the mixture weights. [sent-96, score-0.28]
40 Using polarization on the second term, we obtain the inner product N (0. [sent-97, score-0.175]
41 x,i i CKL (ax , ay ) = (7) i=1 Note that (7) can also be expressed more compactly as CKL (ax , ay ) = mt ((0. [sent-100, score-0.29]
42 2 GLDS kernel (CGLDS ) An alternate inner product approach is to use generalized linear discriminants and the corresponding kernel [4]. [sent-106, score-0.293]
43 (10) i=1 The corresponding kernel between two sequences is then N KGLDS (xNx , y1 y ) = bt Γ−1 by x 1 where 1 Γ= Nz (11) Nz b(zi )b(zi )t , (12) i=1 and zNz is a large set of feature vectors which is representative of the speaker population. [sent-110, score-0.662]
44 Thus, the GLDS kernel inner product is CGLDS (ax , ay ) = (mx − m)t (λx ⊗ In )Γ−1 (λy ⊗ In )(my − m). [sent-113, score-0.355]
45 3 Gaussian-Distributed Vectors A common assumption in the factor analysis literature [5] is that the parameter vector mx as x varies has a Gaussian distribution. [sent-117, score-0.185]
46 We now apply the discriminant function to another mean vector, my , and obtain the following comparison function CG (ax , ay ) = (mx − m)t Υ−1 (my − m). [sent-123, score-0.263]
47 The resulting kernel is CGM (ax , ay ) = (mx − m)t (λ1/2 ⊗ In )Σ−1 (λ1/2 ⊗ In )(my − m) x y 4 (18) where Σ is the block diagonal UBM covariances. [sent-129, score-0.216]
48 The main variations are the choice of covariance in the inner product and the choice of normalization of the gradient term. [sent-132, score-0.199]
49 The resulting comparison function is t CF (ax , ay ) = (λx ⊗ In )Σ−1 (mx − m) Φ−1 (λy ⊗ In )Σ−1 (my − m) (19) where Φ is a diagonal matrix acting as a variance normalizer. [sent-135, score-0.226]
50 Another form of inner product may be derived from the linear Qscoring shown in [6]. [sent-137, score-0.175]
51 In this case, the scoring is given as (mtrain − m)t Σ−1 (F − Nm) where N and F are the zeroth and first order sufficient statistics of a test utterance, m is the UBM means, mtrain is the mean of a training model, and Σ is the block diagonal UBM covariances. [sent-138, score-0.159]
52 A close approximation of this function can be made by using a small relevance factor in MAP adaptation of the means to obtain the following comparison function CQ (ax , ay ) = (mx − m)t Σ−1 (λy ⊗ In )(my − m). [sent-139, score-0.308]
53 (20) Note that if we symmetrize CQ , this gives us CKL ; this analysis ignores for a moment that in [6], compensation is also asymmetric. [sent-140, score-0.214]
54 By assuming the mixture weights are constant and equal to the UBM mixture in the comparison function CKL (7), we obtain the KL kernel, KKL (mx , my ) = mt (λ ⊗ In ) Σ−1 my x (21) where λ are the UBM mixture weights. [sent-142, score-0.303]
55 This kernel has been used extensively in SVM speaker recognition [2]. [sent-143, score-0.668]
56 An analysis of the different inner products in the preceding sections shows that many of the methods presented in the literature have a similar form, but are interestingly derived with quite disparate techniques. [sent-144, score-0.179]
57 5 Compensation in IPDFs Our next task is to explore compensation methods for IPDFs. [sent-146, score-0.214]
58 With these methods, the fundamental assumption is that either speakers and/or nuisances are confined to a small subspace in the parameter vector space. [sent-148, score-0.192]
59 Standard notation is to use U to represent the nuisance subspace and to have V represent the speaker subspace. [sent-150, score-0.75]
60 Our goal in this section is to recast many of the methods in the literature in a standard framework with oblique and orthogonal projections. [sent-151, score-0.149]
61 We define an orthogonal projection with respect to a metric, PU,D , where D and U are full rank matrices as PU,D = U (U t D2 U )−1 U t D2 (22) where DU is a linearly independent set, and the metric is x − y D = Dx − Dy 2 . [sent-153, score-0.188]
62 For convenience, we also define the projection ˆ ˆ onto the orthogonal complement of U , U ⊥ , as QU,D = PU ⊥ ,D = I − PU,D . [sent-157, score-0.177]
63 Note that we can regularize the projection PU,D by adding a diagonal term to the inverse in (22); the resulting operation remains linear but is no longer a projection. [sent-158, score-0.138]
64 We also define the oblique projection onto V with null space U + (U + V )⊥ and metric induced by D. [sent-159, score-0.26]
65 Then, the oblique (non-orthogonal) projection onto V is OV,U,D = V (Qt D2 V )−1 Qt D2 . [sent-163, score-0.217]
66 1 Nuisance Attribute Projection (NAP) A framework for eliminating nuisances in the parameter vector based on projection was shown in [2]. [sent-166, score-0.203]
67 The basic idea is to assume that nuisances are confined to a small subspace and can be removed via an orthogonal projection, mx → QU,D mx . [sent-167, score-0.497]
68 One justification for using subspaces comes from the perspective that channel classification can be performed with inner products along one-dimensional subspaces. [sent-168, score-0.216]
69 The NAP projection uses the metric induced by a kernel in an SVM. [sent-170, score-0.204]
70 We note that since D is known a priori to speaker comparison, we can orthonormalize the matrix DU and apply the projection as a matrix multiply. [sent-172, score-0.66]
71 In (24), the matrix U represents a nuisance subspace, and V represents a speaker subspace. [sent-178, score-0.693]
72 Existing work on this approach for speaker recognition uses both maximum likelihood (ML) estimates and MAP estimates of x and y [9, 5]. [sent-179, score-0.609]
73 In this case, the term V y is replaced by a constant vector representing the true speaker model, ms ; the goal is then to estimate x. [sent-183, score-0.663]
74 Now use relevance MAP with a small relevance factor and F and N to obtain ms ; i. [sent-192, score-0.22]
75 We obtain U t Σ−1 (λs ⊗ In )U x = U t Σ−1 (λs ⊗ In ) [ms − m] (26) where λs is the speaker dependent mixture weights. [sent-195, score-0.62]
76 Once the solution to (26) is obtained, the resulting U x is subtracted from an estimate of the speaker mean, ms to obtain the compensated mean. [sent-198, score-0.701]
77 If we assume that ms is obtained by a relevance map adaptation from the statistics F and N with a small relevance factor, then the FA process is well approximated by ms → QU,D ms (27) where D = λ1/2 ⊗ In Σ−1/2 . [sent-199, score-0.46]
78 , the JFA process is ms → OV,U,I P[UV ],D ms = OV,U,D ms . [sent-205, score-0.315]
79 3 Comments and Analysis Several comments should be made on compensation schemes and their use in speaker comparison. [sent-207, score-0.772]
80 A second observation is that the JFA oblique projection onto V has substantially different properties than a standard orthogonal projection. [sent-211, score-0.26]
81 When JFA is used in speaker recognition [5, 6], typically JFA is performed in training, but the test utterance is compensated only with FA. [sent-212, score-0.714]
82 One difficulty is that the metric used for the inner product may not correspond to the metric for compensation. [sent-217, score-0.261]
83 6 Speaker Comparison Experiments Experiments were performed on the NIST 2006 speaker recognition evaluation (SRE) data set. [sent-220, score-0.609]
84 The dimension of the nuisance subspace, U , was fixed at 100; the dimension of the speaker space, V , was fixed at 300. [sent-227, score-0.693]
85 Second, the compensation method that dominates good performance is an orthogonal complement of the nuisance subspace, QU,D . [sent-236, score-0.392]
86 Combining a nuisance projection with an oblique projection is fine, but 7 Table 1: A comparison of baseline systems and different IPDF implementations Comparison Function Baseline SVM Baseline JFA, CQ CKL CKL CKL CKL CKL CGM CGM CGM KKL KKL CGLDS CG CF Enroll Comp. [sent-237, score-0.525]
87 We also observe that a nuisance projection with fixed DUBM gives similar performance to a projection involving a “variable” metric, Di . [sent-322, score-0.339]
88 7 Conclusions and future work We proposed a new framework for speaker comparison, IPDFs, and showed that several recent systems in the speaker recognition literature can be placed in this framework. [sent-327, score-1.167]
89 We demonstrated that using mixture weights in the inner product is the key component to achieve significant reductions in error rates over a baseline SVM system. [sent-328, score-0.271]
90 We also showed that elimination of the nuisance subspace via an orthogonal projection is a computationally simple and effective method of compensation. [sent-329, score-0.337]
91 Most effective methods of compensation in the literature (NAP, FA, JFA) are straightforward variations of this idea. [sent-330, score-0.214]
92 Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. [sent-348, score-0.617]
93 Gales, “Derivative and parametric kernels for speaker verification,” in Proc. [sent-355, score-0.558]
94 Campbell, “Generalized linear discriminant sequence kernels for speaker recognition,” in Proc. [sent-360, score-0.631]
95 Dumouchel, “A study of inter-speaker variability in speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, 2008. [sent-368, score-0.558]
96 [6] Ondrej Glembek, Lukas Burget, Najim Dehak, Niko Brummer, and Patrick Kenny, “Comparison of scoring methods used in speaker recognition with joint factor analysis,” in Proc. [sent-369, score-0.71]
97 [9] Simon Lucey and Tsuhan Chen, “Improved speaker verification through probabilistic subspace adaptation,” in Proc. [sent-382, score-0.615]
98 [10] Robbie Vogt, Brendan Baker, and Sridha Sriharan, “Modelling session variability in text-independent speaker verification,” in Proc. [sent-385, score-0.558]
99 Le, “NIST speaker recognition evaluations utilizing the Mixer corpora—2004,2005,2006,” IEEE Trans. [sent-401, score-0.609]
100 [13] Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas, “Score normalization for textindependent speaker verification systems,” Digital Signal Processing, vol. [sent-407, score-0.558]
wordName wordTfidf (topN-words)
[('speaker', 0.558), ('ubm', 0.302), ('compensation', 0.214), ('ckl', 0.201), ('jfa', 0.201), ('ipdfs', 0.168), ('gmm', 0.159), ('cgm', 0.151), ('ipdf', 0.151), ('mx', 0.148), ('nuisance', 0.135), ('inner', 0.133), ('ax', 0.123), ('ay', 0.121), ('ms', 0.105), ('projection', 0.102), ('nap', 0.101), ('nuisances', 0.101), ('fa', 0.098), ('nist', 0.094), ('cq', 0.088), ('xnx', 0.088), ('glds', 0.084), ('oblique', 0.083), ('eer', 0.073), ('eng', 0.073), ('discriminant', 0.073), ('utterances', 0.072), ('comparison', 0.069), ('utterance', 0.067), ('kkl', 0.067), ('mindcf', 0.067), ('sre', 0.067), ('veri', 0.066), ('scoring', 0.064), ('mixture', 0.062), ('kl', 0.061), ('kernel', 0.059), ('subspace', 0.057), ('speech', 0.055), ('ml', 0.055), ('svm', 0.054), ('recognition', 0.051), ('cglds', 0.05), ('enroll', 0.05), ('htk', 0.05), ('lexington', 0.05), ('lincoln', 0.05), ('sturim', 0.05), ('mt', 0.048), ('products', 0.046), ('vectors', 0.045), ('gmms', 0.044), ('orthogonal', 0.043), ('metric', 0.043), ('adaptation', 0.042), ('product', 0.042), ('relevance', 0.039), ('cf', 0.038), ('compensated', 0.038), ('subspaces', 0.037), ('factor', 0.037), ('diagonal', 0.036), ('interspeech', 0.036), ('speakers', 0.034), ('dehak', 0.034), ('dubm', 0.034), ('gales', 0.034), ('kenny', 0.034), ('lx', 0.034), ('ly', 0.034), ('mtrain', 0.034), ('baseline', 0.034), ('campbell', 0.033), ('cx', 0.033), ('onto', 0.032), ('geometric', 0.031), ('icassp', 0.03), ('conversation', 0.029), ('ds', 0.028), ('audio', 0.027), ('nz', 0.027), ('compensating', 0.027), ('qt', 0.026), ('metrics', 0.026), ('reynolds', 0.025), ('precomputed', 0.025), ('argminx', 0.025), ('cg', 0.025), ('zeroth', 0.025), ('expansion', 0.025), ('divergence', 0.025), ('map', 0.025), ('xi', 0.025), ('gy', 0.024), ('gx', 0.024), ('covariance', 0.024), ('laboratory', 0.023), ('recast', 0.023), ('stacked', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
Author: Zahi Karam, Douglas Sturim, William M. Campbell
Abstract: Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.
2 0.14319289 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng
Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1
3 0.13775943 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
Author: Arthur Choi, Adnan Darwiche
Abstract: We introduce a new perspective on approximations to the maximum a posteriori (MAP) task in probabilistic graphical models, that is based on simplifying a given instance, and then tightening the approximation. First, we start with a structural relaxation of the original model. We then infer from the relaxation its deficiencies, and compensate for them. This perspective allows us to identify two distinct classes of approximations. First, we find that max-product belief propagation can be viewed as a way to compensate for a relaxation, based on a particular idealized case for exactness. We identify a second approach to compensation that is based on a more refined idealized case, resulting in a new approximation with distinct properties. We go on to propose a new class of algorithms that, starting with a relaxation, iteratively seeks tighter approximations. 1
4 0.095389761 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
Author: Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj
Abstract: In this paper we present an algorithm for separating mixed sounds from a monophonic recording. Our approach makes use of training data which allows us to learn representations of the types of sounds that compose the mixture. In contrast to popular methods that attempt to extract compact generalizable models for each sound from training data, we employ the training data itself as a representation of the sources in the mixture. We show that mixtures of known sounds can be described as sparse combinations of the training data itself, and in doing so produce significantly better separation results as compared to similar systems based on compact statistical models. Keywords: Example-Based Representation, Signal Separation, Sparse Models. 1
Author: Natasha Singh-miller, Michael Collins
Abstract: We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P (y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning label embeddings (similar to error-correcting output codes (ECOCs)) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate significant improvements in terms of word error rate (WER) on a lecture recognition task over a state-of-the-art baseline GMM model. 1
6 0.049407043 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
7 0.049013786 128 nips-2009-Learning Non-Linear Combinations of Kernels
8 0.045663837 129 nips-2009-Learning a Small Mixture of Trees
9 0.043273721 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm
10 0.04113429 196 nips-2009-Quantification and the language of thought
11 0.041030128 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
12 0.040869694 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
13 0.040372398 157 nips-2009-Multi-Label Prediction via Compressed Sensing
14 0.039999232 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
15 0.037882272 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
16 0.037108768 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
17 0.035992302 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
18 0.035356671 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
19 0.035199512 97 nips-2009-Free energy score space
20 0.03488652 119 nips-2009-Kernel Methods for Deep Learning
topicId topicWeight
[(0, -0.126), (1, 0.032), (2, -0.015), (3, 0.049), (4, -0.019), (5, -0.036), (6, 0.007), (7, 0.03), (8, -0.022), (9, 0.008), (10, 0.022), (11, 0.0), (12, -0.032), (13, 0.039), (14, 0.031), (15, -0.041), (16, 0.072), (17, 0.034), (18, -0.043), (19, -0.0), (20, -0.029), (21, -0.057), (22, 0.01), (23, 0.054), (24, -0.083), (25, 0.005), (26, 0.159), (27, 0.05), (28, -0.022), (29, 0.067), (30, -0.091), (31, 0.035), (32, 0.081), (33, 0.003), (34, 0.1), (35, -0.023), (36, 0.016), (37, 0.133), (38, 0.096), (39, 0.046), (40, 0.056), (41, -0.153), (42, 0.046), (43, -0.174), (44, 0.051), (45, -0.105), (46, 0.051), (47, -0.067), (48, 0.014), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.91676575 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
Author: Zahi Karam, Douglas Sturim, William M. Campbell
Abstract: Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.
2 0.60140854 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng
Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1
3 0.50880533 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
Author: Arthur Choi, Adnan Darwiche
Abstract: We introduce a new perspective on approximations to the maximum a posteriori (MAP) task in probabilistic graphical models, that is based on simplifying a given instance, and then tightening the approximation. First, we start with a structural relaxation of the original model. We then infer from the relaxation its deficiencies, and compensate for them. This perspective allows us to identify two distinct classes of approximations. First, we find that max-product belief propagation can be viewed as a way to compensate for a relaxation, based on a particular idealized case for exactness. We identify a second approach to compensation that is based on a more refined idealized case, resulting in a new approximation with distinct properties. We go on to propose a new class of algorithms that, starting with a relaxation, iteratively seeks tighter approximations. 1
4 0.41631261 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
Author: Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj
Abstract: In this paper we present an algorithm for separating mixed sounds from a monophonic recording. Our approach makes use of training data which allows us to learn representations of the types of sounds that compose the mixture. In contrast to popular methods that attempt to extract compact generalizable models for each sound from training data, we employ the training data itself as a representation of the sources in the mixture. We show that mixtures of known sounds can be described as sparse combinations of the training data itself, and in doing so produce significantly better separation results as compared to similar systems based on compact statistical models. Keywords: Example-Based Representation, Signal Separation, Sparse Models. 1
5 0.40714368 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares
Author: Jacob Goldberger, Amir Leshem
Abstract: This paper proposes a new algorithm for the linear least squares problem where the unknown variables are constrained to be in a finite set. The factor graph that corresponds to this problem is very loopy; in fact, it is a complete graph. Hence, applying the Belief Propagation (BP) algorithm yields very poor results. The algorithm described here is based on an optimal tree approximation of the Gaussian density of the unconstrained linear system. It is shown that even though the approximation is not directly applied to the exact discrete distribution, applying the BP algorithm to the modified factor graph outperforms current methods in terms of both performance and complexity. The improved performance of the proposed algorithm is demonstrated on the problem of MIMO detection.
6 0.37147319 31 nips-2009-An LP View of the M-best MAP problem
7 0.36359337 129 nips-2009-Learning a Small Mixture of Trees
8 0.36315981 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
10 0.35171387 47 nips-2009-Boosting with Spatial Regularization
11 0.35138997 7 nips-2009-A Data-Driven Approach to Modeling Choice
12 0.3456845 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
13 0.34087846 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation
14 0.33139527 81 nips-2009-Ensemble Nystrom Method
15 0.32321048 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
16 0.31989539 182 nips-2009-Optimal Scoring for Unsupervised Learning
17 0.31834966 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
18 0.31477645 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
19 0.3133541 119 nips-2009-Kernel Methods for Deep Learning
20 0.29986256 97 nips-2009-Free energy score space
topicId topicWeight
[(7, 0.016), (22, 0.391), (24, 0.04), (25, 0.044), (35, 0.045), (36, 0.07), (39, 0.032), (51, 0.016), (58, 0.081), (61, 0.014), (71, 0.052), (81, 0.018), (86, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.73195761 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
Author: Zahi Karam, Douglas Sturim, William M. Campbell
Abstract: Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.
2 0.67557031 21 nips-2009-Abstraction and Relational learning
Author: Charles Kemp, Alan Jern
Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7 ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff i ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔ ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1 ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1 ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9
3 0.61844802 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models
Author: Jonathan W. Pillow
Abstract: Recent work on the statistical modeling of neural responses has focused on modulated renewal processes in which the spike rate is a function of the stimulus and recent spiking history. Typically, these models incorporate spike-history dependencies via either: (A) a conditionally-Poisson process with rate dependent on a linear projection of the spike train history (e.g., generalized linear model); or (B) a modulated non-Poisson renewal process (e.g., inhomogeneous gamma process). Here we show that the two approaches can be combined, resulting in a conditional renewal (CR) model for neural spike trains. This model captures both real-time and rescaled-time history effects, and can be fit by maximum likelihood using a simple application of the time-rescaling theorem [1]. We show that for any modulated renewal process model, the log-likelihood is concave in the linear filter parameters only under certain restrictive conditions on the renewal density (ruling out many popular choices, e.g. gamma with shape κ = 1), suggesting that real-time history effects are easier to estimate than non-Poisson renewal properties. Moreover, we show that goodness-of-fit tests based on the time-rescaling theorem [1] quantify relative-time effects, but do not reliably assess accuracy in spike prediction or stimulus-response modeling. We illustrate the CR model with applications to both real and simulated neural data. 1
4 0.57400799 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
5 0.39574704 104 nips-2009-Group Sparse Coding
Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow
Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification. 1
6 0.39259863 3 nips-2009-AUC optimization and the two-sample problem
7 0.39133731 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
8 0.39030716 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
9 0.38925493 137 nips-2009-Learning transport operators for image manifolds
10 0.38915396 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning
11 0.38789296 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
12 0.38685119 70 nips-2009-Discriminative Network Models of Schizophrenia
13 0.38673621 87 nips-2009-Exponential Family Graph Matching and Ranking
14 0.38662952 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
15 0.3860127 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
16 0.38563737 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
17 0.38513374 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
18 0.38490042 260 nips-2009-Zero-shot Learning with Semantic Output Codes
19 0.38481098 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
20 0.38442114 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data