nips nips2005 nips2005-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. [sent-5, score-0.282]
2 However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. [sent-6, score-0.35]
3 In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. [sent-7, score-0.411]
4 We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. [sent-8, score-0.679]
5 Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. [sent-9, score-0.092]
6 We infer the context in which we are involved to make an adaptive linguistic response by selecting an appropriate model from that information. [sent-11, score-0.126]
7 In natural language processing research, such models are called long-distance language models that incorporate distant effects of previous words over the short-term dependencies between a few words, which are called n-gram models. [sent-12, score-0.524]
8 Besides apparent application in speech recognition and machine translation, we note that many problems of discrete data processing reduce to language modeling, such as information retrieval [1], Web navigation [2], human-machine interaction or collaborative filtering and recommendation [3]. [sent-13, score-0.21]
9 From the viewpoint of signal processing or control theory, context modeling is clearly a filtering problem that estimates the states of a system sequentially along time to predict the outputs according to them. [sent-14, score-0.183]
10 The inherent difficulties that have prevented filtering approaches to language modeling are its discreteness and high dimensionality, which precludes Kalman filters and their extensions that are all designed for vector spaces and distributions like Gaussians. [sent-16, score-0.283]
11 As we note in the following, ordinary discrete HMMs are not powerful enough for this purpose because their true state is restricted to a single hidden component [6]. [sent-17, score-0.098]
12 Therefore, to track context shifts, we need a model that describes changes of multinomial distributions. [sent-22, score-0.372]
13 One model for this purpose is a multinomial extension to the Mean shift model (MSM) recently proposed in the field of statistics [7]. [sent-23, score-0.395]
14 In discrete HMMs, the true state is one of M components and we estimate it stochastically as a multinomial over the M components. [sent-25, score-0.353]
15 On the other hand, since the true state here is itself a multinomial over the components, we estimate it stochastically as (possibly a mixture of) a Dirichlet distribution, a distribution of multinomial distributions on the (M − 1)-simplex. [sent-26, score-0.774]
16 However, because the true state here is a multinomial over the latent variables, there are dependencies between the states that are assumed independent in the FHMM. [sent-28, score-0.366]
17 Below, we briefly introduce a multinomial Mean shift model following [7] and an associated solution using a Particle Filter. [sent-29, score-0.326]
18 2 Multinomial Mean Shift Model The MSM is a generative model that describes the intermittent changes of hidden states and outputs according to them. [sent-31, score-0.095]
19 Although there is a corresponding counterpart using Normal distribution that was first introduced [8, 9], here we concentrate on a multinomial extension of MSM, following [7] for DNA sequence modeling. [sent-32, score-0.352]
20 In a multinomial MSM, we assume time-dependent true multinomials θt that may change occasionally and the following generative model for the discrete outputs yt = y1 y2 . [sent-33, score-0.715]
21 yt (yt ∈ Σ ; Σ is a set of symbols) according to θ1 θ2 . [sent-36, score-0.258]
22 θt : θt ∼ Dir(α) with probability ρ = θt−1 with probability (1−ρ) , (1) yt ∼ Mult(θt ) where Dir(α) and Mult(θ) are a Dirichlet and multinomial distribution with parameters α and θ, respectively. [sent-39, score-0.545]
23 This model first draws a multinomial θ from Dir(α) and samples output y according to θ for a certain interval. [sent-41, score-0.304]
24 When a change point occurs with probability ρ, a new θ is sampled again from Dir(α) and subsequent y is sampled from the new θ. [sent-42, score-0.179]
25 This process continues recursively throughout which neither θt nor the change points are known to us; all we know is the output sequence yt . [sent-43, score-0.332]
26 However, if we know that the change has occurred at time c, y can be predicted exactly. [sent-44, score-0.134]
27 Let It be a binary variable that represents whether a change occurred at time t: that is, It = 1 means there was a change at t (θt = θt−1 ), and It = 0 means there was no change (θt = θt−1 ). [sent-45, score-0.332]
28 p(yt+1 = y | yt , Ic = 1, Ic+1 = · · · = It = 0) = = p(y|θ)p(θ|yc · · · yt )dθ αy + n y , y (αy +ny ) (2) (3) where αy is the y’th element of α and ny is the number of occurrences of y in yc · · · yt . [sent-62, score-0.781]
29 Therefore, the essence of this problem lies in how to detect a change point given the data up to time t, a change point problem in discrete space. [sent-63, score-0.234]
30 3 Multinomial Particle Filter The prediction problem above can be solved by the efficient Particle Filter algorithm shown in Figure 1, graphically displayed in Figure 2 (excluding prior updates). [sent-66, score-0.094]
31 p(It |It−1 , yt ) ∝ p(It , yt |It−1 , yt−1 ) = p(yt |yt−1 , It−1 , It )p(It |It−1 ) = p(yt |yt−1 , It−1 , It = 1)p(It = 1|It−1 ) =: f (t) p(yt |yt−1 , It−1 , It = 0)p(It = 0|It−1 ) =: g(t) (5) (6) leading p(It = 1|It−1 , yt ) = f (t)/(f (t) + g(t)) p(It = 0|It−1 , yt ) = g(t)/(f (t) + g(t)) . [sent-73, score-0.932]
32 (7) In Expression (5), the first term is a likelihood of observation yt when It has been fixed, which can be obtained through (3). [sent-74, score-0.233]
33 However, when we endow ρ with a prior Beta distribution Be(α, β), posterior estimate of ρt given the binary change point history It−1 can be obtained using the number of 1’s in It−1 , nt−1 (1), following a standard Bayesian method: α + nt−1 (1) E[ρt |It−1 ] = . [sent-76, score-0.226]
34 (8) α+β+t−1 This means that we can estimate a “rate of topic shifts” as time proceeds in a Bayesian fashion. [sent-77, score-0.154]
35 The above algorithm runs for each observation yt (t = 1 . [sent-79, score-0.233]
36 If we observe a “strange” word that is more predictable from the prior than the contextual distribution, (6) makes f (t) larger than g(t), which leads to a higher probability that It = 1 will be sampled in the Bernoulli trial of Algorithm 1(b). [sent-83, score-0.334]
37 The first problem is that in a natural language the number of words is extremely large. [sent-88, score-0.299]
38 As opposed to DNA, which has only four letters of A/T/G/C, a natural language usually contains a minimum of some tens of thousands of words and there are strong correlations between them. [sent-89, score-0.299]
39 For example, if “nurse” follows “hospital”we believe that there has been no context shift; however, if “university” follows “hospital,” the context probably has been shifted to a “medical school” subtopic, even though the two words are equally distinct from “hospital. [sent-90, score-0.267]
40 However, the original multinomial MSM cannot capture this relationship because it treats the words independently. [sent-92, score-0.36]
41 To incorporate this relationship, we require an extensive prior knowledge of words as a probabilistic model. [sent-93, score-0.161]
42 The second problem is that in model equation (1), the hyperparameter α of prior Dirichlet distribution of the latent multinomials is assumed to be known. [sent-94, score-0.214]
43 In the case of natural language, this means we know beforehand what words or topics will be spoken for all the texts. [sent-95, score-0.191]
44 Apparently, this is not a natural assumption: we need an online estimation of α as well when we want to extend MSM to natural languages. [sent-96, score-0.115]
45 To solve these problems, we extended a multinomial MSM using two probabilistic text models, LDA and DM. [sent-97, score-0.386]
46 1 MSM-LDA Latent Dirichlet Allocation (LDA) [3] is a probabilistic text model that assumes a hidden multinomial topic distribution θ over the M topics on a document d to estimate it stochastically as a Dirichlet distribution p(θ|d). [sent-100, score-0.768]
47 Context modeling using LDA [5] regards a history h = w1 . [sent-101, score-0.075]
48 wh as a pseudo document and estimates a variational approximation q(θ|h) of a topic distribution p(θ|h) through a variational Bayes EM algorithm on a document [3]. [sent-104, score-0.406]
49 After obtaining topic distribution q(θ|h), we can predict the next word as follows. [sent-105, score-0.389]
50 p(y|h) = p(y|θ)q(θ|h)dθ = M i=1 p(y|θi ) θi q(θ|h) (9) When we use this prediction with an associated VB-EM algorithm in place of the na¨ve ı Dirichlet model (3) of MSM, we get an MSM-LDA that tracks a latent topic distribution θ instead of a word distribution. [sent-106, score-0.489]
51 Since each particle computes a Dirichlet posterior of topic distribution, the final topic distribution of MSM-LDA is a mixture of Dirichlet distributions for predicting the next word through (4) and (9) as shown in Figure 3(a). [sent-107, score-1.012]
52 Note that MSMLDA has an implicit generative model corresponding to (1) in topic space. [sent-108, score-0.191]
53 However, here we use a conditional model where LDA parameters are already known in order to estimate the context online. [sent-109, score-0.093]
54 As seen in Figure 2, each particle has a history that has been segmented into pseudo “documents” d1 . [sent-111, score-0.527]
55 Since each pseudo “document” has a Dirichlet posterior q(θ|di ) (i = 1 . [sent-115, score-0.107]
56 Note that this computation needs only be run when a change point has been sampled. [sent-119, score-0.099]
57 For this purpose, only the sufficient statistics q(θ|di ) must be stored for each particle to render itself an online algorithm. [sent-120, score-0.38]
58 Note in passing that MSM-LDA is a model that only tracks a mixing distribution of a mixture model. [sent-121, score-0.198]
59 Therefore, in principle this model is also applicable to other mixture models, e. [sent-122, score-0.105]
60 Gaussian mixtures, where mixing distribution is not static but evolves according to (1). [sent-124, score-0.089]
61 Time 1 Particle#1 Particle #2 Particle #N Time 1 Context change points Particle #2 Particle #N Context change points prior t prior t Dirichlet distribution wn Word Simplex Topic Subsimplex w1 Particle#1 w2 0. [sent-125, score-0.409]
62 02 Expectation of Unigram Dirichlet distribution Unigram distributions Weights Mixture of Mixture of Dirichlet distributions next word next word (b) MSM-DM (a) MSM-LDA Figure 3: MSM-LDA and MSM-DM in work. [sent-131, score-0.459]
63 However, in terms of multinomial estimation, this generality has a drawback because it uses a lower-dimensional topic representation to predict the next word, which may cause a loss of information. [sent-132, score-0.462]
64 In contrast, MSM-DM is a model that works directly on the word space to predict the next word with no loss of information. [sent-133, score-0.375]
65 2 MSM-DM Dirichlet Mixtures (DM) [11] is a novel Bayesian text model that has the lowest perplexity reported so far in context modeling. [sent-135, score-0.29]
66 DM uses no intermediate “topic” variables, but places a mixture of Dirichlet distributions directly on the word simplex to model word correlations. [sent-136, score-0.544]
67 Specifically, DM assumes the following generative model for a document w = w 1 . [sent-137, score-0.093]
68 where p is a V -dimensional unigram distribution over words, α1 . [sent-151, score-0.24]
69 αM = αM are pa1 rameters of Dirichlet prior distributions of p, and λ is a M -dimensional prior mixing distribution of them. [sent-154, score-0.214]
70 , wD }, parameters λ and αM can be iteratively estimated by a com1 bination of EM algorithm and the modified Newton-Raphson method shown in Figure 5, which is a straight extension to the estimation of a Polya mixture [13]. [sent-159, score-0.145]
71 2 DM is an extension to the model for amino acids [14] to natural language with a huge number of parameters, which precludes the ordinary Newton-Raphson algorithm originally proposed in [14]. [sent-162, score-0.291]
72 V E step: M step: p(m|wi ) ∝ λm λm ∝ Γ( v αmv ) Γ(αmv +niv ) Γ( v αmv + v niv ) v=1 Γ(αmv ) D i=1 p(m|wi ) , (14) i p(m|wi ) niv /(αmv + niv − 1) p(m|wi ) v niv /( v αmv + v niv − 1) i αmv = αmv · (13) (15) Figure 5: EM-Newton algorithm of Dirichlet Mixtures. [sent-163, score-0.54]
73 This prediction can also be considered an extension to Dirichlet smoothing [15] with multiple hyperparameters αm to weigh them accordingly by Cm . [sent-165, score-0.106]
74 3 When we replace a na¨ve Dirichlet model (3) by a DM prediction (10), we get a flexible ı MSM-DM dynamic model that works on word simplex directly. [sent-166, score-0.265]
75 Since the original multinomial MSM places a Dirichlet prior in the model (1), MSM-DM is considered a natural extension to MSM by placing a mixture of Dirichlet priors rather than a single Dirichlet prior for multinomial unigram distribution. [sent-167, score-1.064]
76 Because each particle calculates a mixture of Dirichlet posteriors for the current context, the final MSM-DM estimate is a mixture of them, again a mixture of Dirichlet distributions as shown in Figure 3(b). [sent-168, score-0.708]
77 In this case, we can also update the mixture prior λ sequentially. [sent-169, score-0.188]
78 wc segmented by change points individually, posterior λm can be obtained similarly as (14), c λm ∝ i=1 p(m|wi ) (12) where p(m|wi ) is obtained from (13). [sent-173, score-0.127]
79 We randomly selected 100 files of BNC written texts as an evaluation set, and the remaining 2,943 files as a training set for parameter estimation of LDA and DM in advance. [sent-178, score-0.12]
80 1 Training and evaluation data Since LDA and DM did not converge on the long texts like BNC, we divided training texts into pseudo documents with a minimum of ten sentences for parameter estimation. [sent-180, score-0.432]
81 Due to the huge size of BNC, we randomly selected a maximum of 20 pseudo documents from each of the 2,943 files to produce a final corpus of 56,939 pseudo documents comprising 11,032,233 words. [sent-181, score-0.435]
82 We used a lexicon of 52,846 words with a frequency ≥ 5. [sent-182, score-0.081]
83 Since the proposed method is an algorithm that simultaneously captures topic shifts and their rate in a text to predict the next word, we need evaluation texts that have different rates of topic shifts. [sent-187, score-0.616]
84 For this purpose, we prepared four different text sets by sampling 3 Therefore, MSM-DM is considered an ingenious dynamic Dirichlet smoothing as well as a context modeling. [sent-188, score-0.227]
85 (4) Continue steps (2) and (3) until a desired length of text is obtained. [sent-215, score-0.107]
86 We sampled 100 sentences from each of the 100 files by this procedure to create the four evaluation text sets listed in the table. [sent-217, score-0.233]
87 4 The number of particles is set to N = 20, a relatively small number because each particle executes an exTable 1: Types of Evaluation Texts. [sent-220, score-0.382]
88 act Bayesian prediction once previous change points have been sampled. [sent-221, score-0.138]
89 Beta prior distribution of context change can be initialized as a uniform distribution, (α, β) = (1, 1). [sent-222, score-0.28]
90 However, based on a preliminary experiment we set it to (α, β) = (1, 50): this means we initially assume a context change rate of once every 50 words in average, which will be updated adaptively. [sent-223, score-0.273]
91 3 Experimental results Table 2 shows the unigram perplexity of contextual prediction for each type of evaluation set. [sent-225, score-0.445]
92 While MSM-LDA slightly improves LDA due to the topic space compression explained in Section 3. [sent-227, score-0.154]
93 1, MSM-DM yields a consistently better prediction, and its performance is more significant for texts whose subtopics change faster. [sent-228, score-0.176]
94 We can see that prediction improves for most documents by automatically selecting appropriate contexts. [sent-230, score-0.124]
95 Finally, we show in Figure 7 a sequential plot of context change probabilities p (i) (It = 1) (i = 1. [sent-232, score-0.192]
96 T ) calculated by each particle for the first 1,000 words of one of the evaluation texts. [sent-236, score-0.477]
97 5 Conclusion and Future Work In this paper, we extended the multinomial Particle Filter of a small number of symbols to natural language with an extremely large number of symbols. [sent-237, score-0.497]
98 1 0 100 200 300 Perplexity reduction 15 10 0 1000 Particle 5 900 800 700 600 500 Time 400 300 200 100 0 0 Figure 6: Perplexity reductions of MSM Figure 7: Context change probabilities for 1,000 words text, sampled by the particles. [sent-247, score-0.22]
99 According to this model, prediction is made using a mixture of different context lengths sampled by each Monte Carlo particle. [sent-250, score-0.277]
100 Although the proposed method is still in its fundamental stage, we are planning to extend it to larger units of change points beyond words, and to use a forward-backward MCMC or Expectation Propagation to model a semantic structure of text more precisely. [sent-251, score-0.244]
wordName wordTfidf (topN-words)
[('dirichlet', 0.387), ('particle', 0.353), ('multinomial', 0.279), ('msm', 0.249), ('yt', 0.233), ('dm', 0.215), ('unigram', 0.207), ('mv', 0.184), ('language', 0.174), ('word', 0.173), ('topic', 0.154), ('lda', 0.146), ('bnc', 0.124), ('niv', 0.108), ('pseudo', 0.107), ('text', 0.107), ('mixture', 0.105), ('dir', 0.104), ('change', 0.099), ('context', 0.093), ('perplexity', 0.09), ('documents', 0.085), ('words', 0.081), ('texts', 0.077), ('wn', 0.068), ('filter', 0.068), ('contextual', 0.066), ('mult', 0.061), ('latent', 0.061), ('dna', 0.058), ('wt', 0.057), ('mixtures', 0.056), ('document', 0.056), ('hmm', 0.055), ('prior', 0.055), ('simplex', 0.053), ('shifts', 0.052), ('corpus', 0.051), ('wi', 0.051), ('shift', 0.047), ('les', 0.046), ('natural', 0.044), ('sentences', 0.043), ('evaluation', 0.043), ('chronological', 0.041), ('hospital', 0.041), ('mikio', 0.041), ('nara', 0.041), ('ppl', 0.041), ('yc', 0.041), ('ltering', 0.041), ('ny', 0.041), ('extension', 0.04), ('sampled', 0.04), ('distributions', 0.04), ('prediction', 0.039), ('history', 0.039), ('bayesian', 0.038), ('stochastically', 0.038), ('semantic', 0.038), ('generative', 0.037), ('modeling', 0.036), ('discrete', 0.036), ('nv', 0.036), ('occurred', 0.035), ('thomas', 0.035), ('topics', 0.035), ('cm', 0.034), ('hyperparameter', 0.034), ('hidden', 0.033), ('linguistic', 0.033), ('precludes', 0.033), ('distribution', 0.033), ('na', 0.032), ('mixing', 0.031), ('ic', 0.031), ('multinomials', 0.031), ('spoken', 0.031), ('predict', 0.029), ('purpose', 0.029), ('particles', 0.029), ('tracks', 0.029), ('update', 0.028), ('draw', 0.028), ('filtering', 0.028), ('factorial', 0.028), ('hmms', 0.028), ('nt', 0.028), ('segmented', 0.028), ('smoothing', 0.027), ('online', 0.027), ('bayes', 0.027), ('um', 0.026), ('dependencies', 0.026), ('according', 0.025), ('beta', 0.025), ('incorporate', 0.025), ('carlo', 0.025), ('monte', 0.025), ('dc', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
2 0.23616709 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
3 0.19805171 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
4 0.13708827 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
5 0.12893695 45 nips-2005-Conditional Visual Tracking in Kernel Space
Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas
Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |yt−1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |yt−1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |yt−1 , zt )p(yt−1 |Zt−1 ) (1) yt−1 We compute using a conditional mixture for p(yt |yt−1 , zt ) (a Bayesian mixture of experts c.f . §2.2) and the prior p(yt−1 |Zt−1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z ∈ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) ⊂ Fr O / y ∈ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) ⊂ Fx x ≈ PreImage(y) O ĂŽĹšr ĂŽĹšx r ∈ R ⊂ Rr x ∈ X ⊂ Rx p(x|r) ≈ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) â‰Ä„ p(xt |rt ), ∀t. (We similarly learn p(xt |xt−1 , rt ), with input (x, r) instead of r – here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |yt−1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|δ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|δ j ) is a softmax function parameterized by δ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and δ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from it’s image y ∈ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), â„Ĺš) (3) with parameters and covariances (A, â„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|δ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + â„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we ‘project’ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |yt−1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xt−1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |yt−1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of ‘multimodality’ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (yt−1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (yt−1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |yt−1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model §2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |yt−1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency − Closest to Ground truth Frequency − Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |yt−1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181–214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720–736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. Sch¨ lkopf, A. Smola, and K. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759–766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69–76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.
6 0.087134719 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
7 0.085888736 98 nips-2005-Infinite latent feature models and the Indian buffet process
8 0.082108408 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
9 0.08125484 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
10 0.075647764 156 nips-2005-Prediction and Change Detection
11 0.075276896 195 nips-2005-Transfer learning for text classification
12 0.074205577 171 nips-2005-Searching for Character Models
13 0.0724658 184 nips-2005-Structured Prediction via the Extragradient Method
14 0.071116179 185 nips-2005-Subsequence Kernels for Relation Extraction
15 0.07004559 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
16 0.067792401 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
17 0.064426072 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
18 0.063699752 50 nips-2005-Convex Neural Networks
19 0.060924146 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
20 0.059372842 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
topicId topicWeight
[(0, 0.184), (1, 0.018), (2, 0.019), (3, 0.103), (4, 0.104), (5, -0.253), (6, 0.063), (7, 0.151), (8, -0.109), (9, -0.062), (10, -0.13), (11, -0.026), (12, 0.041), (13, -0.24), (14, -0.226), (15, -0.002), (16, -0.016), (17, 0.133), (18, -0.119), (19, -0.017), (20, -0.107), (21, -0.031), (22, -0.187), (23, -0.007), (24, -0.012), (25, 0.125), (26, 0.04), (27, 0.015), (28, -0.115), (29, -0.038), (30, 0.025), (31, -0.095), (32, 0.015), (33, 0.018), (34, -0.105), (35, 0.015), (36, 0.017), (37, 0.052), (38, 0.097), (39, -0.085), (40, -0.072), (41, -0.033), (42, 0.007), (43, -0.042), (44, 0.02), (45, 0.014), (46, -0.05), (47, 0.093), (48, -0.04), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.96897161 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
2 0.80120951 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
3 0.6490835 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
4 0.51740545 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
5 0.4513112 171 nips-2005-Searching for Character Models
Author: Jaety Edwards, David Forsyth
Abstract: We introduce a method to automatically improve character models for a handwritten script without the use of transcriptions and using a minimum of document specific training data. We show that we can use searches for the words in a dictionary to identify portions of the document whose transcriptions are unambiguous. Using templates extracted from those regions, we retrain our character prediction model to drastically improve our search retrieval performance for words in the document.
6 0.4456999 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
7 0.41658241 156 nips-2005-Prediction and Change Detection
8 0.39384836 45 nips-2005-Conditional Visual Tracking in Kernel Space
9 0.36459211 98 nips-2005-Infinite latent feature models and the Indian buffet process
10 0.33073595 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
11 0.31357396 185 nips-2005-Subsequence Kernels for Relation Extraction
12 0.31096676 184 nips-2005-Structured Prediction via the Extragradient Method
13 0.30756581 33 nips-2005-Bayesian Sets
14 0.3030892 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
15 0.29685336 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
16 0.29582849 166 nips-2005-Robust Fisher Discriminant Analysis
17 0.26615804 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
18 0.26557022 35 nips-2005-Bayesian model learning in human visual perception
19 0.26274315 144 nips-2005-Off-policy Learning with Options and Recognizers
20 0.25623107 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
topicId topicWeight
[(3, 0.021), (10, 0.056), (18, 0.014), (27, 0.068), (31, 0.061), (34, 0.054), (41, 0.018), (55, 0.037), (57, 0.015), (69, 0.04), (73, 0.092), (80, 0.026), (88, 0.112), (91, 0.056), (96, 0.23)]
simIndex simValue paperId paperTitle
same-paper 1 0.81659484 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
2 0.61165124 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
3 0.588983 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
4 0.58224517 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
5 0.58089644 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
6 0.57826364 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
7 0.57575959 30 nips-2005-Assessing Approximations for Gaussian Process Classification
8 0.5716098 45 nips-2005-Conditional Visual Tracking in Kernel Space
9 0.56961578 144 nips-2005-Off-policy Learning with Options and Recognizers
10 0.56856596 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
11 0.56828636 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
12 0.5670917 35 nips-2005-Bayesian model learning in human visual perception
13 0.56663924 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
14 0.56534761 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
15 0.56347758 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
16 0.56148398 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
17 0.56070304 158 nips-2005-Products of ``Edge-perts
18 0.55845183 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
19 0.55831063 23 nips-2005-An Application of Markov Random Fields to Range Sensing
20 0.5570392 136 nips-2005-Noise and the two-thirds power Law