nips nips2006 nips2006-180 knowledge-graph by maker-knowledge-mining

180 nips-2006-Speakers optimize information density through syntactic reduction

Source: pdf

Author: T. F. Jaeger, Roger P. Levy

Abstract: If language users are rational, they might choose to structure their utterances so as to optimize communicative properties. In particular, information-theoretic and psycholinguistic considerations suggest that this may include maximizing the uniformity of information density in an utterance. We investigate this possibility in the context of syntactic reduction, where the speaker has the option of either marking a higher-order unit (a phrase) with an extra word, or leaving it unmarked. We demonstrate that speakers are more likely to reduce less information-dense phrases. In a second step, we combine a stochastic model of structured utterance production with a logistic-regression model of syntactic reduction to study which types of cues speakers employ when estimating the predictability of upcoming elements. We demonstrate that the trend toward predictability-sensitive syntactic reduction (Jaeger, 2006) is robust in the face of a wide variety of control variables, and present evidence that speakers use both surface and structural cues for predictability estimation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Speakers optimize information density through syntactic reduction Roger Levy Department of Linguistics UC San Diego 9500 Gilman Drive La Jolla, CA 92093-0108, USA rlevy@ling. [sent-1, score-0.552]

2 edu Abstract If language users are rational, they might choose to structure their utterances so as to optimize communicative properties. [sent-6, score-0.262]

3 In particular, information-theoretic and psycholinguistic considerations suggest that this may include maximizing the uniformity of information density in an utterance. [sent-7, score-0.132]

4 We investigate this possibility in the context of syntactic reduction, where the speaker has the option of either marking a higher-order unit (a phrase) with an extra word, or leaving it unmarked. [sent-8, score-0.426]

5 We demonstrate that speakers are more likely to reduce less information-dense phrases. [sent-9, score-0.352]

6 In a second step, we combine a stochastic model of structured utterance production with a logistic-regression model of syntactic reduction to study which types of cues speakers employ when estimating the predictability of upcoming elements. [sent-10, score-1.621]

7 We demonstrate that the trend toward predictability-sensitive syntactic reduction (Jaeger, 2006) is robust in the face of a wide variety of control variables, and present evidence that speakers use both surface and structural cues for predictability estimation. [sent-11, score-1.498]

8 As a result, speakers are often confronted with choices as to how to structure their intended message into an utterance. [sent-13, score-0.395]

9 Under these circumstances, if speakers are rational then we can expect them to attempt to optimize the communicative properties of their utterances. [sent-15, score-0.466]

10 But what are the communicative properties that speakers choose to optimize? [sent-16, score-0.432]

11 The prevalence of ambiguity in natural language—the fact that many structural analyses are typically available for a given utterance—might lead one to expect that speakers seek to minimize structural ambiguity, but both experimental (Arnold et al. [sent-17, score-0.53]

12 , 2006, inter alia) investigations have found little evidence for active use of ambiguity-avoidance strategies. [sent-19, score-0.114]

13 In this paper we argue for a different locus of optimization: that speakers structure utterances so as to optimize information density. [sent-20, score-0.432]

14 If speakers behave optimally, they should structure their utterances so as to avoid peaks and troughs in information density (see also (Aylett and Turk, 2004; Genzel and Charniak, 2002)). [sent-22, score-0.513]

15 For example, this principle of uniform information density (UID) as an aspect of rational language production predicts that speakers should modulate phonetic duration in accordance with the predictability of the unit expressed. [sent-23, score-1.181]

16 If UID is a general principle of communicative optimality, however, its effects should be apparent at higher levels of linguistic production as well. [sent-26, score-0.251]

17 For phonetic reduction, choices about word duration can directly modulate information density. [sent-28, score-0.218]

18 However, it is less clear how the effects of UID at higher levels of language production observed by Genzel and Charniak (2002) and Keller (2004) come about. [sent-29, score-0.209]

19 Genzel and Charniak (2002) show that at least part of their result is driven by the repetition of open-class words, but it is unclear how this effect relates to a broader range of choice points within language production. [sent-30, score-0.126]

20 In particular, it is unclear whether any choices above the lexical level are affected by information density (as expected if UID is general). [sent-31, score-0.202]

21 In this paper we present the ﬁrst evidence that speakers’ choice during syntactic planning is affected by information density optimization. [sent-32, score-0.491]

22 This evidence comes from syntactic reduction—a phenomenon in which speakers have the choice of either marking a phrase with an optional word, or leaving it unmarked (Section 3). [sent-33, score-0.974]

23 We show that in cases where the phrase is marked, the marking reduces the phrase’s information density, and that the phrases that get marked are the ones that would otherwise be the most information-dense (Section 4). [sent-34, score-0.167]

24 This provides crucial support for UID as a general principle of language production. [sent-35, score-0.126]

25 The possibility that speakers’ use of syntactic reduction optimizes information density leads to questions as to how speakers estimate the probability of an upcoming syntactic event. [sent-36, score-1.291]

26 In particular, one can ask what types of cues language users employ when estimating these probabilites. [sent-37, score-0.204]

27 For example, speakers could compute information density using only surface cues (such as the words immediately preceding a phrase). [sent-38, score-0.601]

28 On the other hand, they might also take structural features of the utterance into account. [sent-39, score-0.283]

29 We investigate these issues in Section 5 using an incremental model of structured utterance production. [sent-40, score-0.293]

30 In this model, the predictability of the upcoming phrase markable by the optional word is taken as a measure of the phrase’s information density. [sent-41, score-0.826]

31 The resulting predictability estimate, in turn, becomes a covariate in a separate model of syntactic reduction. [sent-42, score-0.773]

32 Through this two-step modeling approach we show that predictability is able to explain a signiﬁcant part of the variability in syntactic reduction, and that evidence exists for speakers using both structural and surface cues in estimating phrasal predictability. [sent-43, score-1.512]

33 2 Optimal information density in linguistic utterances We begin with the information-theoretic deﬁnition that the information conveyed by a complete ut1 terance u is u’s Shannon information content (also called its surprisal), or log2 P (u) . [sent-44, score-0.343]

34 Optimization of information density entails that the information conveyed by each wi should be as uniform and close to an ideal value as possible. [sent-46, score-0.204]

35 First, the transmission of a message via spoken or written language can be viewed as a noisy channel. [sent-48, score-0.126]

36 From this assumption it follows that information density is optimized near the channel capacity, where speakers maximize the rate of information transmission while minimizing the danger of a mistransmitted message (see also Aylett (2000); Aylett and Turk (2004); Genzel and Charniak (2002)). [sent-49, score-0.481]

37 Second and independently of whether linguistic communication is viewed as a noisy channel, UID can be seen as minimizing comprehension difﬁculty. [sent-51, score-0.141]

38 The difﬁculty incurred by a comprehender in processing a word wi is positively correlated with its surprisal (Hale, 2001; Levy, 2006). [sent-52, score-0.258]

39 If the effect of surprisal on difﬁculty is superlinear, then the total difﬁculty of the utterance u 1 ( n [log P (wi |w1 ···wi−1 ) ]k with k > 1) is minimized when information density is uniform (for i=1 proof see appendix; see also Levy 2005, ch. [sent-53, score-0.354]

40 3 Syntactic reduction UID would be optimal in several ways, but do speakers actually consider UID as a factor when making choices during online syntactic production? [sent-56, score-0.842]

41 We address this question by directly linking a syntactic choice point to UID. [sent-57, score-0.334]

42 if it applies to all aspects of language production, we should ﬁnd its effects even in structural choices. [sent-60, score-0.215]

43 At the onset of an RC speakers can, but do not have to, utter the relativizer that. [sent-62, score-0.574]

44 2 We refer to the omission of that as syntactic REDUCTION. [sent-63, score-0.397]

45 The information density of the RC and subsequent parts of the sentence can be quantiﬁed by their Shannon information content. [sent-70, score-0.174]

46 As a ﬁrst test of this prediction, we use n-gram language models to measure the relationship between the Shannon information content of the ﬁrst word of an RC and the tendency toward syntactic reduction. [sent-71, score-0.593]

47 8 N= 1674 −5 −4 −3 −2 −1 log(P(W1 | W−2 W−1)) Figure 1: RC n-gram-estimated information density and syntactic reduction. [sent-81, score-0.439]

48 4 For every actual instance of an RC onset · · · w−2 w−1 (that)w1 · · · we calculated the trigram probability P (w1 |w−2 w−1 ): that is, an estimate of the probability that w1 would have if no relativizer had been used, regardless of whether a relativizer was actually used or not. [sent-86, score-0.474]

49 We then examined the relationship between this probability and the outcome event: whether or not a relativizer was actually used. [sent-87, score-0.212]

50 Figure 4 shows the relationship between the different quantiles of the log-probability of w1 and the likelihood of syntactic reduction. [sent-88, score-0.334]

51 This inverse relationship between w1 surprisal and relativizer use matches the predictions of UID. [sent-90, score-0.266]

52 5 5 Structural predictability and speaker choice Section 4 provides evidence that speakers’ choices about syntactic reduction are correlated with information density: RC onsets that would be more informationally dense in reduced form are less likely to be reduced. [sent-91, score-1.062]

53 This observation does not, however, provide strong evidence that speakers are directly sensitive to information density in their choices about reduction. [sent-92, score-0.552]

54 Furthermore, if speakers are sensitive to information density in their reduction choices, it raises a new question: what kind of information is taken into account in speakers’ estimation of information density? [sent-93, score-0.618]

55 This section addresses the questions of whether reduction is directly sensitive to information density, and what information might be used in estimates of information density, using a two-step modeling approach. [sent-94, score-0.211]

56 The ﬁrst step involves a incremental stochastic model of structured utterance production. [sent-95, score-0.293]

57 4 Omitting optional relativizers in the language model can alternatively be interpreted as assuming that speakers equate (3) with the second term of (2)—that is, the presence or absence of the relativizer is ignored in estimating the probablity of the ﬁrst word of a relative clause. [sent-98, score-0.932]

58 5 We also calculated the relationship for estimates of RC information density using a trigram model of the Switchboard corpus as-is. [sent-99, score-0.184]

59 The incremental parse through world consists of everything to the left of the dashed line. [sent-105, score-0.148]

60 given point in a sentence, given an incremental structural representation for the sentence up to that point. [sent-106, score-0.258]

61 Because the target event space of this term is small, a wide variety of cues, or features, can be included in the model, and the reliability of the resulting predictability estimates is relatively high. [sent-107, score-0.496]

62 The resulting predictability estimates serve as a crucial covariate in the second step: a logistic regression model including a number of control factors (see Section 3 and appendix). [sent-110, score-0.439]

63 3 as a stringent test of the explanatory power of UID for speakers’ reduction choices, and in Section 5. [sent-112, score-0.159]

64 4 to determine whether evidence exists for speakers using structural as well as surface cues in their predictability estimates. [sent-113, score-1.077]

65 1 A structural predictability model In this section we present a method of estimating the predictability of a relative clause in its sentential context, contingent on the structural analysis of that context. [sent-115, score-1.123]

66 For simplicity, we assume that structural analyses are context-free trees, and that the complete, correct incremental analysis of the sentential context is available for conditioning. [sent-116, score-0.24]

67 6 In general, the task is to estimate P (RCn+1··· |w1···n , T1···n ) (4) that is, the probability that a phrase of type RC appears in the utterance beginning at wn+1 , given the incremental structured utterance w1···n , T1···n . [sent-117, score-0.57]

68 As in Collins (2003) and Roark (2001), this type of directed generative process allows conditioning on arbitrary features of the incremental utterance. [sent-121, score-0.182]

69 6 If predictability from the perspective of the comprehender rather than the producer is taken to be of primary interest, this assumption may seem controversial. [sent-122, score-0.479]

70 Nevertheless, there is little evidence that incremental structural misanalysis is a pervasive phenomenon in naturally occuring language (Roland et al. [sent-123, score-0.391]

71 , 2006), and the types of incremental utterances occurring immediately before relative clauses do not seem to be good candidates for local misanalysis. [sent-124, score-0.207]

72 From a practical perspective, assuming access to the correct incremental analysis avoids the considerable difﬁculty involved in the incremental parsing of speech. [sent-125, score-0.248]

73 For RC predictability estimation, the only relevant category distinctions are between RC, ∗EN D∗, and any other non-null category, so we limit our event space to these three outcomes. [sent-131, score-0.496]

74 9 We included ﬁve classes of features in our models, chosen by linguistic considerations of what is likely to help predict the next event given an active node in an incremental utterance (see Wasow et al. [sent-137, score-0.548]

75 8 The phrase structures found in the Penn Treebank were ﬂattened and canonicalized to ensure that the incremental parse structures do not contain implicit information about upcoming constituents. [sent-141, score-0.333]

76 9 The predictability models were heavily overparameterized, and to prevent overﬁtting were regularized with a quadratic Bayesian prior. [sent-144, score-0.439]

77 3 Explanatory power of phrasal predictability We use the same statistical procedures as in (Jaeger, 2006, Chapter 4) to put the predictions of the information-density hypothesis to a more stringent test. [sent-148, score-0.566]

78 We evaluate the explanatory power of phrasal predictability in logistic regression models of syntactic reduction that include all the control variables otherwise known to inﬂuence relativizer omission (Section 3). [sent-149, score-1.308]

79 10 Phrasal predictability of the RC (based on the full feature set listed in Section 5. [sent-151, score-0.439]

80 2) was entered into this model as a covariate to test whether RC predictability co-determines syntactic reduction after other factors are controlled for. [sent-152, score-0.912]

81 Phrasal predictability makes a signiﬁcant contribution to the relativizer omission model (χ2 (1) = 54. [sent-153, score-0.688]

82 This demonstrates that phrasal predictability has explanatory power in this case of syntactic reduction. [sent-156, score-0.946]

83 4 Surface and structural conditioning of phrasal predictability The structural predictability model puts us in a position to ask whether empirically observed patterns of syntactic reduction give evidence for speakers’ use of some types of cues but not others. [sent-158, score-1.819]

84 In particular, there is a question of whether predictability based on surface cues alone (the NGRAM features of Section 5. [sent-159, score-0.609]

85 2) provides a complete description of information-density effects on speakers’ choices in syntactic reduction. [sent-160, score-0.377]

86 We tested this by building a syntactic-reduction model containing two predictability covariates: one using NGRAM features alone, and one using all other (i. [sent-161, score-0.464]

87 We can then test whether the parameter weight in the reduction model for each predictability measure differs signiﬁcantly from zero. [sent-165, score-0.578]

88 It turns out that both predictability measures matter: all-but-NGRAM predictability is highly signiﬁcant (χ2 (1) = 23. [sent-166, score-0.878]

89 0001), but NGRAM predictability is also signiﬁcant (χ2 (1) = 5. [sent-168, score-0.439]

90 70), they evidently exhibit enough differences to contribute non-redundant information in the reduction model. [sent-172, score-0.137]

91 We interpret this as evidence that speakers may be using both surface and structural cues for phrasal predictability estimation in utterance structuring. [sent-173, score-1.347]

92 6 Conclusion Using a case study in syntactic reduction, we have argued that information-density optimization— the tendency to maximize the uniformity of upcoming-event probabilities at each part of a sentence—plays an important role in speakers’ choices about structuring their utterances. [sent-174, score-0.377]

93 This question has been previously addressed in the context of phonetic reduction of highly predictable words and syllables (Aylett and Turk, 2004; Bell et al. [sent-175, score-0.18]

94 We have found evidence that speakers may be using both phrasal and structural information to calculate upcoming-event predictabilities. [sent-178, score-0.644]

95 The overall distribution of syntactic reduction has the effect of smoothing the information proﬁle of the sentence: when the function word is not omitted, the information density of the immediately following words is reduced. [sent-179, score-0.71]

96 On this view, syntactic reduction is available to the speaker as a pressure valve to regulate information density when it is dangerously high. [sent-187, score-0.609]

97 Equivalently, the presence of a function word can be interpreted as a signal to the comprehender to expect the unexpected, a rational exchange of time for reduced information density, or a meaningful delay (Jaeger, 2005). [sent-188, score-0.207]

98 The availability of more than one way to express a given meaning grants speakers the choice to select the optimal alternative for each communicative act. [sent-192, score-0.432]

99 The idea to derive estimates of RC predictability based on multiple cues originated in discussion with T. [sent-194, score-0.517]

100 Effects of disﬂuencies, predictability, and utterance position on word form variation in English conversation. [sent-227, score-0.278]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('predictability', 0.439), ('speakers', 0.352), ('rc', 0.343), ('syntactic', 0.334), ('relativizer', 0.186), ('utterance', 0.169), ('uid', 0.139), ('phrasal', 0.127), ('language', 0.126), ('incremental', 0.124), ('reduction', 0.113), ('word', 0.109), ('phrase', 0.108), ('aylett', 0.106), ('jaeger', 0.106), ('genzel', 0.093), ('optional', 0.093), ('rcs', 0.093), ('np', 0.093), ('structural', 0.089), ('linguistic', 0.088), ('production', 0.083), ('density', 0.081), ('communicative', 0.08), ('surprisal', 0.08), ('cues', 0.078), ('charniak', 0.074), ('english', 0.07), ('turk', 0.069), ('ngram', 0.066), ('relativizers', 0.066), ('wasow', 0.066), ('omission', 0.063), ('speaker', 0.057), ('event', 0.057), ('utterances', 0.056), ('linguistics', 0.056), ('node', 0.055), ('daughter', 0.053), ('switchboard', 0.053), ('upcoming', 0.053), ('pietra', 0.053), ('evidence', 0.052), ('levy', 0.049), ('explanatory', 0.046), ('conveyed', 0.046), ('sentence', 0.045), ('choices', 0.043), ('phonetic', 0.042), ('dis', 0.042), ('surface', 0.041), ('alia', 0.04), ('clause', 0.04), ('comprehender', 0.04), ('keller', 0.04), ('preterminal', 0.04), ('rcn', 0.04), ('trigram', 0.04), ('corpus', 0.039), ('wn', 0.039), ('redundancy', 0.037), ('onset', 0.036), ('head', 0.036), ('jj', 0.035), ('roland', 0.035), ('marking', 0.035), ('roark', 0.035), ('rational', 0.034), ('conditioning', 0.033), ('spontaneous', 0.033), ('overt', 0.032), ('inter', 0.032), ('active', 0.03), ('stanford', 0.03), ('della', 0.029), ('wi', 0.029), ('shannon', 0.028), ('lexical', 0.028), ('expanded', 0.027), ('communication', 0.027), ('animacy', 0.027), ('audience', 0.027), ('clauses', 0.027), ('daughters', 0.027), ('ferreira', 0.027), ('gilman', 0.027), ('hale', 0.027), ('jurafsky', 0.027), ('nps', 0.027), ('prosodic', 0.027), ('psycholinguistic', 0.027), ('sentential', 0.027), ('uency', 0.027), ('whether', 0.026), ('words', 0.025), ('uc', 0.025), ('features', 0.025), ('speech', 0.025), ('information', 0.024), ('parse', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 180 nips-2006-Speakers optimize information density through syntactic reduction

Author: T. F. Jaeger, Roger P. Levy

2 0.098669745 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

Author: Joseph Turian, Benjamin Wellington, I. D. Melamed

Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the ﬁrst purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and eﬃciency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1

3 0.068365827 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen

Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identiﬁcation and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the ﬁlter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch ﬂuctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efﬁciency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative inﬂuence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , deﬁned above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be deﬁned in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simpliﬁed using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and ﬁnd the optimal state sequence. In [8], a simpliﬁed model was used to infer binary masking values for reﬁltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is signiﬁcantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are signiﬁcantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to ﬁnd the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By ﬁnding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efﬁcient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which ﬁnds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to ﬁnd the most probable paths leading to each state by ﬁnding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efﬁcient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is speciﬁed. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most beneﬁt, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the beneﬁts of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and reﬁltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efﬁcient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efﬁcient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.

4 0.062704623 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efﬁcient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efﬁcient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

5 0.058910333 33 nips-2006-Analysis of Representations for Domain Adaptation

Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira

Abstract: Discriminative learning methods for classiﬁcation perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classiﬁer which performs well on a target domain with a different distribution. Under what conditions can we adapt a classiﬁer trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justiﬁcation for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1

6 0.055169851 139 nips-2006-Multi-dynamic Bayesian Networks

7 0.051816244 113 nips-2006-Learning Structural Equation Models for fMRI

8 0.050550256 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

9 0.046556354 20 nips-2006-Active learning for misspecified generalized linear models

10 0.040274382 69 nips-2006-Distributed Inference in Dynamical Systems

11 0.038580347 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

12 0.036718588 110 nips-2006-Learning Dense 3D Correspondence

13 0.035919014 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

14 0.035590861 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

15 0.034894705 115 nips-2006-Learning annotated hierarchies from relational data

16 0.03468534 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

17 0.033114161 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

18 0.032782651 167 nips-2006-Recursive ICA

19 0.032647662 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

20 0.031784877 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.11), (1, 0.008), (2, 0.035), (3, -0.054), (4, 0.023), (5, 0.021), (6, 0.019), (7, -0.019), (8, -0.015), (9, -0.064), (10, -0.061), (11, -0.009), (12, 0.011), (13, 0.055), (14, 0.001), (15, -0.095), (16, 0.054), (17, 0.036), (18, 0.045), (19, -0.017), (20, -0.086), (21, 0.025), (22, -0.082), (23, 0.008), (24, -0.004), (25, -0.074), (26, -0.016), (27, -0.045), (28, -0.038), (29, -0.034), (30, 0.06), (31, -0.037), (32, -0.097), (33, -0.031), (34, -0.185), (35, 0.071), (36, -0.085), (37, 0.123), (38, 0.079), (39, 0.045), (40, 0.081), (41, 0.035), (42, -0.06), (43, 0.039), (44, 0.036), (45, 0.007), (46, -0.045), (47, -0.119), (48, -0.12), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94338793 180 nips-2006-Speakers optimize information density through syntactic reduction

Author: T. F. Jaeger, Roger P. Levy

2 0.57458103 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

Author: Joseph Turian, Benjamin Wellington, I. D. Melamed

3 0.57204354 139 nips-2006-Multi-dynamic Bayesian Networks

Author: Karim Filali, Jeff A. Bilmes

Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-speciﬁc independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efﬁcient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and beneﬁts, including Bayesian networks, undirected Markov random ﬁelds, and factor graphs. Moreover, for different types of scientiﬁc modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indeﬁnitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random ﬁelds (CRFs), where an underlying set of variables gets repeated as needed to ﬁll any ﬁnite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the speciﬁcation of rules for expanding any particular instance of a model is ﬁnite. In other words, these forms of GM allow the speciﬁcation of models with an unlimited number of random variables (RVs) using a ﬁnite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the speciﬁcation of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s ﬁnite description length, we utilize a switching parent functionality (also called value-speciﬁc independence). Second, in order to capture the notion of fertility in MT-systems (deﬁned later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more speciﬁc class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well deﬁned over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given ﬁxed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a ﬁnite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, speciﬁes how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a ﬁrst order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difﬁcult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the ﬁgure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay ﬁnite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in ﬁg. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the ﬁgure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the ﬁgure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisﬁed2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of conﬁgurations exponential in the number of constrained variables, while typically only a small fraction of the conﬁgurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the ﬁgure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisﬁed. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it ofﬂine, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an ofﬂine triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the ﬂexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style ofﬂine triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and brieﬂy describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (ﬁg. 1) since it constitutes the ﬁrst step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the ﬁrst pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the ﬁgure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufﬁcient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the ﬁrst one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all conﬁgurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable conﬁgurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the ﬁrst level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in ﬁg. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the ﬁgure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in ﬁg. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the conﬁguration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the ﬁrst pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the ﬁrst pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively ﬁnd the optimal variable conﬁguration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantiﬁes how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest speciﬁed percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefﬁcient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine ﬂavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.

4 0.51380348 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen

5 0.51095861 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

Author: Mark Johnson, Thomas L. Griffiths, Sharon Goldwater

Abstract: This paper introduces adaptor grammars, a class of probabilistic models of language that generalize probabilistic context-free grammars (PCFGs). Adaptor grammars augment the probabilistic rules of PCFGs with “adaptors” that can induce dependencies among successive uses. With a particular choice of adaptor, based on the Pitman-Yor process, nonparametric Bayesian models of language using Dirichlet processes and hierarchical Dirichlet processes can be written as simple grammars. We present a general-purpose inference algorithm for adaptor grammars, making it easy to deﬁne and use such models, and illustrate how several existing nonparametric Bayesian models can be expressed within this framework. 1

6 0.34791771 69 nips-2006-Distributed Inference in Dynamical Systems

7 0.3397848 131 nips-2006-Mixture Regression for Covariate Shift

8 0.32261008 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.3090598 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

10 0.29961732 41 nips-2006-Bayesian Ensemble Learning

11 0.29656234 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

12 0.29629004 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

13 0.29535675 174 nips-2006-Similarity by Composition

14 0.29323727 110 nips-2006-Learning Dense 3D Correspondence

15 0.29201412 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

16 0.28987604 53 nips-2006-Combining causal and similarity-based reasoning

17 0.27659711 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

18 0.26672769 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

19 0.26573968 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

20 0.26573405 20 nips-2006-Active learning for misspecified generalized linear models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.071), (3, 0.014), (7, 0.049), (9, 0.027), (12, 0.028), (20, 0.018), (22, 0.073), (36, 0.399), (44, 0.067), (57, 0.064), (65, 0.04), (69, 0.021), (71, 0.024), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79756045 180 nips-2006-Speakers optimize information density through syntactic reduction

Author: T. F. Jaeger, Roger P. Levy

2 0.3529126 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can signiﬁcantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efﬁcient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efﬁciently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efﬁciently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efﬁciency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be deﬁned a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in ﬁnding a neuron’s “preferred stimulus,” is the stimulus which maximizes the ﬁring rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to deﬁne the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identiﬁcation approach: we deﬁne the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), speciﬁed by a ﬁnite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difﬁcult to realize this goal in a computationally efﬁcient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) ﬁnd the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on ﬁnding solutions which scale well in high dimensions. We solve problem (1) by using efﬁcient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efﬁcient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous ﬁring rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; brieﬂy, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-ﬁre cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the ﬁlter coefﬁcients ki,t−l capture the dependence of the neuron’s instantaneous ﬁring rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive ﬁelds. For convenience, we arrange all the stimulus coefﬁcients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive ﬁeld. The coefﬁcients aj model the dependence on the observed recent activity r at time t − j (these terms may reﬂect e.g. refractory effects, burstiness, ﬁring-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown ﬁlter coefﬁcients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our ﬁrst key task is to efﬁciently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive ﬁeld for a simulated visual neuron. The neuron’s receptive ﬁeld θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-ﬁt by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justiﬁed by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simpliﬁes matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to ﬁnd the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the ﬁrst term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efﬁciency; our simulations (data not shown) showed that, despite this improved efﬁciency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the ﬁring rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The ﬁrst term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefﬁcients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefﬁcients, k. B) The estimated spike history coefﬁcients, a. C) The ﬁnal estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we ﬁnd that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undeﬁned, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly deﬁned by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by ﬁxing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then ﬁnd the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive ﬁeld when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to ﬁnd the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch signiﬁcantly faster and more stable. To summarize, updating the posterior and ﬁnding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modiﬁcation of Ct−1 , and there exist efﬁcient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modiﬁcation of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deﬂation which reduces the cost of matrix multiplications associated with ﬁnding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we ﬁxed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to ﬁnd the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefﬁcients, k, and the spike history coefﬁcients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our ﬁrst simulation considered the use of our algorithm for learning the receptive ﬁeld of a visually sensitive neuron. We took the neuron’s receptive ﬁeld to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efﬁcient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate ﬁtted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive ﬁeld of a neuron whose response depends on its past spiking. Here, the stimulus coefﬁcients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modiﬁcation of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modiﬁcation. the spike history coefﬁcients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefﬁcients and the spike history coefﬁcients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefﬁcients. Our ﬁnal set of results, Fig. 3, considers a neuron whose receptive ﬁeld drifts non-systematically with time. We take the receptive ﬁeld to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efﬁcient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides signiﬁcant improvements over random sampling; indeed, the differences in efﬁciency are large enough that the information-optimization strategy may permit robust system identiﬁcation in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique signiﬁcantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

3 0.351706 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring signiﬁcant computational beneﬁts in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simpliﬁed models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.

4 0.35064876 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off ﬁdelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational beneﬁts of the approach to the some of the problems on which IC has been demonstrated. 1

5 0.35045266 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

6 0.35035962 20 nips-2006-Active learning for misspecified generalized linear models

7 0.34989747 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

8 0.34941012 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

9 0.3491444 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

10 0.34782302 33 nips-2006-Analysis of Representations for Domain Adaptation

11 0.34676087 61 nips-2006-Convex Repeated Games and Fenchel Duality

12 0.34457457 79 nips-2006-Fast Iterative Kernel PCA

13 0.34418955 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

14 0.34349084 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

15 0.34317461 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.34309858 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

17 0.34308615 98 nips-2006-Inferring Network Structure from Co-Occurrences

18 0.3425248 194 nips-2006-Towards a general independent subspace analysis

19 0.34236369 21 nips-2006-AdaBoost is Consistent

20 0.34202105 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning