nips nips2007 nips2007-95 knowledge-graph by maker-knowledge-mining

95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Source: pdf

Author: Bing Zhao, Eric P. Xing

Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-ﬂow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). [sent-8, score-1.293]

2 The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. [sent-10, score-0.91]

3 1 Introduction Most contemporary SMT systems view parallel data as independent sentence-pairs whether or not they are from the same document-pair. [sent-12, score-0.096]

4 Consequently, translation models are learned only at sentence-pair level, and document contexts – essential factors for translating documents – are generally overlooked. [sent-13, score-0.483]

5 Indeed, translating documents differs considerably from translating a group of unrelated sentences. [sent-14, score-0.194]

6 One should avoid destroying a coherent document by simply translating it into a group of sentences which are indifferent to each other and detached from the context. [sent-16, score-0.217]

7 Developments in statistics, genetics, and machine learning have shown that latent semantic aspects of complex data can often be captured by a model known as the statistical admixture (or mixed membership model [4]). [sent-17, score-0.205]

8 Statistically, an object is said to be derived from an admixture if it consists of a bag of elements, each sampled independently or coupled in a certain way, from a mixture model. [sent-18, score-0.112]

9 In the context of SMT, each parallel document-pair is treated as one such object. [sent-19, score-0.096]

10 Variants of admixture models have appeared in population genetics [6] and text modeling [1, 4]. [sent-24, score-0.122]

11 Recently, a Bilingual Topic-AdMixture (BiTAM) model was proposed to capture the topical aspects of SMT [10]; word-pairs from a parallel document-pair follow the same weighted mixtures of translation lexicons, inferred for the given document-context. [sent-25, score-0.631]

12 However, they do not capture locality 1 constraints of word alignment, i. [sent-27, score-0.191]

13 , words “close-in-source” are usually aligned to words “close-intarget”, under document-speciﬁc topical assignment. [sent-29, score-0.28]

14 To incorporate such constituents, we integrate the strengths of both HMM and BiTAM, and propose a Hidden Markov Bilingual Topic-AdMixture model, or HM-BiTAM, for word alignment to leverage both locality constraints and topical context underlying parallel document-pairs. [sent-30, score-0.683]

15 In the HM-BiTAM framework, one can estimate topic-speciﬁc word-to-word translation lexicons (lexical mappings), as well as the monolingual topic-speciﬁc word-frequencies for both languages, based on parallel document-pairs. [sent-31, score-0.87]

16 The resulting model offers a principled way of inferring optimal translation from a given source language in a context-dependent fashion. [sent-32, score-0.371]

17 We show our model’s effectiveness on the word-alignment task; we also demonstrate two application aspects which were untouched in [10]: the utility of HM-BiTAM for bilingual topic exploration, and its application for improving translation qualities. [sent-34, score-0.896]

18 2 Revisit HMM for SMT An SMT system can be formulated as a noisy-channel model [2]: e∗ = arg max P (e|f ) = arg max P (f |e)P (e), e (1) e where a translation corresponds to searching for the target sentence e∗ which explains the source sentence f best. [sent-35, score-0.417]

19 The key component is P (f |e), the translation model; P (e) is monolingual language model. [sent-36, score-0.635]

20 An HMM implements the “proximity-bias” assumption — that words “close-in-source” are aligned to words “close-in-target”, which is effective for improving word alignment accuracies, especially for linguistically close language-pairs [8]. [sent-38, score-0.537]

21 Following [8], to model word-to-word translation, we introduce the mapping j → aj , which assigns a French word fj in position j to an English word ei in position i = aj denoted as eaj . [sent-39, score-0.687]

22 Each (ordered) French word fj is an observation, and it is generated by an HMM state deﬁned as [eaj , aj ], where the alignment indicator aj for position j is considered to have a dependency on the previous alignment aj−1 . [sent-40, score-0.863]

23 Thus a ﬁrst-order HMM for an alignment between e ≡ e1:I and f ≡ f1:J is deﬁned as: J p(f1:J |e1:I ) = p(fj |eaj )p(aj |aj−1 ), (2) a1:J j=1 where p(aj |aj−1 ) is the state transition probability; J and I are sentence lengths of the French and English sentences, respectively. [sent-41, score-0.278]

24 An additional pseudo word ”NULL” is used at the beginning of English sentences for HMM to start with. [sent-43, score-0.242]

25 2 3 Hidden Markov Bilingual Topic-AdMixture We assume that in training corpora of bilingual documents, the document-pair boundaries are known, and indeed they serve as the key information for deﬁning document-speciﬁc topic weights underlying aligned sentence-pairs or word-pairs. [sent-48, score-0.653]

26 To simplify the outline, the topics here are sampled at sentence-pair level; topics sampled at word-pair level can be easily derived following the outlined algorithms, in the same spirit of [10]. [sent-49, score-0.48]

27 Given a document-pair (F, E) containing N parallel sentence-pairs (en , fn ), HM-BiTAM implements the following generative scheme. [sent-50, score-0.155]

28 The sentence-pairs {fn , en } are drawn independently from a mixture of topics. [sent-54, score-0.071]

29 For each sentence-pair (fn , en ), (a) zn ∼ Multinomial(θ) sample the topic (b) en,1:In |zn ∼ P (en |zn ; β) sample all English words from a monolingual topic model (e. [sent-58, score-0.89]

30 , an unigram model), (c) For each position jn = 1, . [sent-60, score-0.223]

31 ajn ∼ P (ajn |ajn −1 ;T ) sample an alignment link ajn from a ﬁrst-order Markov process, ii. [sent-64, score-0.495]

32 fjn ∼ P (fjn |en , ajn , zn ; B) sample a foreign word fjn according to a topic speciﬁc translation lexicon. [sent-65, score-1.221]

33 Under an HM-BiTAM model, each sentence-pair consists of a mixture of latent bilingual topics; each topic is associated with a distribution over bilingual word-pairs. [sent-66, score-0.908]

34 Each word f is generated by two hidden factors: a latent topic z drawn from a document-speciﬁc distribution over K topics, and the English word e identiﬁed by the hidden alignment variable a. [sent-67, score-0.921]

35 2 Extracting Bilingual Topics from HM-BiTAM Because of the parallel nature of the data, the topics of English and the foreign language will share similar semantic meanings. [sent-69, score-0.513]

36 Shown in Figure 1(b), both the English and foreign topics are sampled from the same distribution θ, which is a documentspeciﬁc topic-weight vector. [sent-71, score-0.355]

37 , unigram) of foreign word fw under topic k can be computed by P (fw |k) = P (fw |e, Bk )P (e|βk ). [sent-75, score-0.562]

38 (3) e As a result, HM-BiTAM can actually be used as a bilingual topic explorer in the LDA-style and beyond. [sent-76, score-0.552]

39 Given paired documents, it can extract the representations of each topic in both languages in a consistent fashion (which is not guaranteed if topics are extracted separately from each language using, e. [sent-77, score-0.56]

40 , LDA), as well as the lexical mappings under each topics, based on a maximal likelihood or Bayesian principle. [sent-79, score-0.089]

41 3 4 Learning and Inference We sketch a generalized mean-ﬁeld approximation scheme for inferring latent variables in HMBiTAM, and a variational EM algorithm for estimating model parameters. [sent-83, score-0.092]

42 (6) represents the approximate posterior of the ˆ topic weights for each sentence-pair (fn , en ). [sent-95, score-0.292]

43 The topical information for updating φn is collected from three aspects: aligned word-pairs weighted by the corresponding topic-speciﬁc translation lexicon probabilities, topical distributions of monolingual English language model, and the smoothing factors from the topic prior. [sent-96, score-1.313]

44 Equation (7) gives the approximate posterior probability for alignment between the j-th word in fn and the i-th word in en , in the form of an exponential model. [sent-97, score-0.743]

45 Inference of optimum word-alignment One of the translation model’s goals is to infer the optimum word alignment: a∗ = arg maxa P (a|F, E). [sent-99, score-0.481]

46 The variational inference scheme described above leads to an approximate alignment posterior q(a|λ), which is in fact a reparameterized HMM. [sent-100, score-0.326]

47 Thus, extracting the optimum alignment amounts to applying an Viterbi algorithm on q(a|λ). [sent-101, score-0.231]

48 5 Experiments In this section, we investigate three main aspects of the HM-BiTAM model, including word alignment, bilingual topic exploration, and machine translation. [sent-107, score-0.797]

49 The training data is a collection of parallel document-pairs, with document boundaries explicitly given. [sent-114, score-0.164]

50 As shown in Table 1, our training corpora are general newswire, covering topics mainly about economics, politics, educations and sports. [sent-115, score-0.277]

51 This test set contains relatively long sentence-pairs, with an average sentence length of 40. [sent-118, score-0.047]

52 The long sentences introduce more ambiguities for alignment tasks. [sent-120, score-0.282]

53 For testing translation quality, TIDES’02 MT evaluation data is used as development data, and ten documents from TIDES’04 MT-evaluation are used as the unseen test data. [sent-121, score-0.424]

54 BLEU scores are reported to evaluate translation quality with HM-BiTAM models. [sent-122, score-0.29]

55 1 Empirical Validation Word Alignment Accuracy We trained HM-BiATMs with ten topics using parallel corpora of sizes ranging from 6M to 22. [sent-124, score-0.405]

56 Following the same logics for all BiTAMs in [10], we choose HM-BiTAM in which topics are sampled at word-pair level over sentence-pair level. [sent-126, score-0.24]

57 Figure 2 shows the alignment accuracies of HM-BiTAM, in comparison with that of the baselineHMM, the baseline BiTAM, and the IBM Model-4. [sent-129, score-0.258]

58 ent models trained on corpora of different sizes. [sent-140, score-0.066]

59 In HM-BiTAM, two factors contribute to narrowing down the word-alignment decisions: the position and the lexical mapping. [sent-143, score-0.086]

60 Whereas the emission lexical probability is different, each state is a mixture of topic-speciﬁc translation lexicons, of which the weights are inferred using document contexts. [sent-145, score-0.445]

61 The topic-speciﬁc translation lexicons are sharper and smaller than the global one used in HMM. [sent-146, score-0.477]

62 Not surprisingly, HM-BiTAM also outperforms the baseline-BiTAM signiﬁcantly, because BiTAM captures only the topical aspects and ignores the proximity bias. [sent-148, score-0.219]

63 However, IBM Model-4 does not have a scheme to adjust its lexicon probabilities speciﬁc to document topicalcontext as in HM-BiTAM. [sent-159, score-0.16]

64 In a way, HM-BiTAM wins over IBM-4 by leveraging topic models that capture the document context. [sent-160, score-0.289]

65 Overall the likelihoods under HM-BiTAM are signiﬁcantly better than those under HMM and IBM Model-4, revealing the better modeling power of HM-BiTAM. [sent-163, score-0.066]

66 As shown in Table 2, the likelihoods of HM-BiTAM on these unseen data dominates signiﬁcantly over that of HMM, BiTAM, and IBM Models in every case, conﬁrming that HM-BiTAM indeed offers a better ﬁt and generalizability for the bilingual document-pairs. [sent-166, score-0.417]

67 Publishers Genre IBM-1 HMM IBM-4 BiTAM HM-BiTAM AgenceFrance(AFP) AgenceFrance(AFP) AgenceFrance(AFP) ForeignMinistryPRC HongKongNews People’s Daily United Nation XinHua News XinHua News ZaoBao News news news news speech speech editorial speech news news editorial -3752. [sent-167, score-0.423]

68 Perplexity Table 2: Likelihoods of unseen documents under HM-BiTAMs, in comparison with competing models. [sent-222, score-0.102]

69 2 Application 1: Bilingual Topic Extraction Monolingual topics: HM-BiTAM facilitates inference of the latent LDA-style representations of topics [1] in both English and the foreign language (i. [sent-224, score-0.46]

70 The English topics (represented by the topic-speciﬁc word frequencies) can be directly read-off from HM-BiTAM parameters β. [sent-227, score-0.402]

71 2, even though the topic-speciﬁc distributions 6 of words in the Chinese corpora are not directly encoded in HM-BiTAM, one can marginalize over alignments of the parallel data to synthesize them based on the monolingual English topics and the topic-speciﬁc lexical mapping from English to Chinese. [sent-229, score-0.802]

72 The top-ranked frequent words in each topic exhibit coherent semantic meanings; and there are also consistencies between the word semantics under the same topic indexes across languages. [sent-231, score-0.745]

73 Under HM-BiTAM, the two respective monolingual word-distributions for the same topic are statistically coupled due to sharing of the same topic for each sentence-pair in the two languages. [sent-232, score-0.739]

74 Whereas if one merely apply LDA to the corpora in each language separately, such coupling can not be exploited. [sent-233, score-0.114]

75 This coupling enforces consistency between the topics across languages. [sent-234, score-0.211]

76 However, like general clustering algorithms, topics in HM-BiTAM, are not necessarily to present obvious semantic labels. [sent-235, score-0.254]

77 ) (reporters) (relations) (Russian) (France) (ChongQing) (countries) (ChongQing) (Factory) (TianJin) (Government) (project) (national) (Shenzhen) (take over) (buy) Figure 4: Monolingual topics of both languages learned from parallel data. [sent-238, score-0.354]

78 It appears that the English topics (on the left panel) are highly parallel to the Chinese ones (annotated with English gloss, on the right panel). [sent-239, score-0.307]

79 Topic-Speciﬁc Lexicon Mapping: Table 3 shows two examples of topic-speciﬁc lexicon mapping learned by HM-BiTAM. [sent-240, score-0.123]

80 Given a topic assignment, a word usually has much less translation candidates, and the topic-speciﬁc translation lexicons are generally much smaller and sharper. [sent-241, score-1.179]

81 Different topic-speciﬁc lexicons emphasize different aspects of translating the same source words, which can not be captured by the IBM models or HMM. [sent-242, score-0.343]

82 Topics Topic-1 Topic-2 Topic-3 Topic-4 Topic-5 Topic-6 Topic-7 Topic-8 Topic-9 Topic-10 IBM Model-1 HMM IBM Model-4 TopCand ° Ú ó Æ - ° ° ° ° Ú - “meet” Meaning sports meeting to satisfy to adapt to adjust to see someone to satisfy sports meeting to see someone Probability 0. [sent-244, score-0.402]

83 551466 sports meeting sports meeting sports meeting 0. [sent-252, score-0.516]

84 608391 TopCand - ¦ “power” Meaning electric power electricity factory to be relevant strength strength Electric watt power to generate strength Probability 0. [sent-255, score-0.16]

85 506258 Table 3: Topic-speciﬁc translation lexicons learned by HM-BiTAM. [sent-267, score-0.477]

86 We show the top candidate (TopCand) lexicon mappings of “meet” and “power” under ten topics. [sent-268, score-0.152]

87 (The symbol “-” means inexistence of signiﬁcant lexicon mapping under that topic. [sent-269, score-0.123]

88 ) Also shown are the semantic meanings of the mapped Chinese words, and the mapping probability p(f |e, k). [sent-270, score-0.1]

89 3 Application 2: Machine Translation The parallelism of topic-assignment between languages modeled by HM-BiTAM, as shown in § 3. [sent-272, score-0.047]

90 4, enables a natural way of improving translation by exploiting semantic consistency and contextual coherency more explicitly and aggressively. [sent-274, score-0.333]

91 (11) k=1 We used p(e|f, DF ) to score the bilingual phrase-pairs in a state-of-the-art GALE translation system trained with 250 M words. [sent-277, score-0.621]

92 Then decoding of the unseen ten MT04 documents in Table 2 was carried out. [sent-279, score-0.134]

93 Experiments using the topic assignments inferred from ground truth and the ones inferred via HM-BITAM; ngram precisions together with ﬁnal BLEUr4n4 scores are evaluated. [sent-303, score-0.306]

94 If we know the ground truth of translation to infer the topic-weights, improvement is from 32. [sent-305, score-0.29]

95 With topical inference from HM-BiTAM using monolingual source document, improved N-gram precisions in the translation were observed from 1-gram to 4-gram. [sent-308, score-0.846]

96 6 Discussion and Conclusion We presented a novel framework, HM-BiTAM, for exploring bilingual topics, and generalizing over traditional HMM for improved word-alignment accuracies and translation quality. [sent-317, score-0.648]

97 A variational inference and learning procedure was developed for efﬁcient training and application in translation. [sent-318, score-0.095]

98 We demonstrated signiﬁcant improvement of word-alignment accuracy over a number of existing systems, and the interesting capability of HM-BiTAM to simultaneously extract coherent monolingual topics from both languages. [sent-319, score-0.537]

99 We also report encouraging improvement of translation quality over current benchmarks; although the margin is modest, it is noteworthy that the current version of HM-BiTAM remains a purely autonomously trained system. [sent-320, score-0.29]

100 A generalized mean ﬁeld algorithm for variational inference in exponential families. [sent-375, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bilingual', 0.331), ('monolingual', 0.297), ('translation', 0.29), ('alignment', 0.231), ('hmm', 0.221), ('topic', 0.221), ('bitam', 0.215), ('topics', 0.211), ('word', 0.191), ('lexicons', 0.187), ('english', 0.169), ('topical', 0.165), ('ibm', 0.16), ('ajn', 0.132), ('jn', 0.132), ('fjn', 0.116), ('smt', 0.116), ('foreign', 0.115), ('sports', 0.098), ('parallel', 0.096), ('lexicon', 0.092), ('df', 0.088), ('admixture', 0.083), ('aj', 0.076), ('meeting', 0.074), ('news', 0.073), ('en', 0.071), ('translating', 0.069), ('document', 0.068), ('variational', 0.067), ('xinhua', 0.066), ('corpora', 0.066), ('unigram', 0.066), ('lexical', 0.061), ('chinese', 0.061), ('fn', 0.059), ('bleu', 0.058), ('documents', 0.056), ('aspects', 0.054), ('sentences', 0.051), ('dirichlet', 0.05), ('afp', 0.05), ('agencefrance', 0.05), ('ein', 0.05), ('tides', 0.05), ('topcand', 0.05), ('language', 0.048), ('sentence', 0.047), ('languages', 0.047), ('unseen', 0.046), ('semantic', 0.043), ('linguistics', 0.04), ('likelihoods', 0.04), ('zn', 0.04), ('words', 0.04), ('eaj', 0.039), ('genetics', 0.039), ('em', 0.038), ('bk', 0.037), ('aligned', 0.035), ('fw', 0.035), ('source', 0.033), ('representations', 0.033), ('bitams', 0.033), ('chongqing', 0.033), ('countries', 0.033), ('doc', 0.033), ('factory', 0.033), ('gale', 0.033), ('hiero', 0.033), ('politics', 0.033), ('precisions', 0.033), ('shenzhen', 0.033), ('twv', 0.033), ('fj', 0.033), ('ten', 0.032), ('french', 0.032), ('mapping', 0.031), ('hidden', 0.031), ('hm', 0.03), ('coherent', 0.029), ('editorial', 0.029), ('hermann', 0.029), ('bf', 0.029), ('bing', 0.029), ('someone', 0.029), ('sampled', 0.029), ('mappings', 0.028), ('eric', 0.028), ('inference', 0.028), ('accuracies', 0.027), ('meanings', 0.026), ('della', 0.026), ('power', 0.026), ('inferred', 0.026), ('position', 0.025), ('lda', 0.025), ('latent', 0.025), ('strength', 0.025), ('pietra', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

2 0.2841149 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

3 0.16954759 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the ﬁtted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the beneﬁts of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1

4 0.16393445 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1

5 0.15765879 183 nips-2007-Spatial Latent Dirichlet Allocation

Author: Xiaogang Wang, Eric Grimson

Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision ﬁeld. However, many of these applications have difﬁculty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be ﬂexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1

6 0.14470981 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

7 0.14225958 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

8 0.13914798 129 nips-2007-Mining Internet-Scale Software Repositories

9 0.12544791 47 nips-2007-Collapsed Variational Inference for HDP

10 0.11979784 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

11 0.1181394 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

12 0.10526638 197 nips-2007-The Infinite Markov Model

13 0.10251486 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

14 0.094546556 9 nips-2007-A Probabilistic Approach to Language Change

15 0.07521192 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

16 0.067823917 200 nips-2007-The Tradeoffs of Large Scale Learning

17 0.046433147 169 nips-2007-Retrieved context and the discovery of semantic structure

18 0.045355413 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

19 0.044273175 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

20 0.042826578 170 nips-2007-Robust Regression with Twinned Gaussian Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, 0.083), (2, -0.078), (3, -0.352), (4, 0.104), (5, -0.108), (6, 0.057), (7, -0.201), (8, -0.099), (9, 0.039), (10, 0.033), (11, -0.033), (12, -0.065), (13, -0.132), (14, 0.01), (15, -0.161), (16, -0.117), (17, 0.045), (18, -0.088), (19, 0.116), (20, 0.006), (21, -0.144), (22, -0.041), (23, -0.032), (24, 0.017), (25, -0.057), (26, 0.035), (27, 0.044), (28, -0.008), (29, 0.017), (30, -0.011), (31, -0.016), (32, -0.026), (33, 0.0), (34, 0.028), (35, 0.021), (36, -0.093), (37, 0.047), (38, 0.05), (39, 0.112), (40, 0.118), (41, 0.014), (42, 0.002), (43, -0.071), (44, -0.057), (45, 0.084), (46, -0.027), (47, -0.02), (48, -0.016), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9733001 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

2 0.7586025 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

3 0.6483658 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

4 0.63233262 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for deﬁning probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efﬁcient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1

5 0.56163555 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and ﬁnd it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difﬁculty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na ﬁvy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must ﬁrst learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must ﬁrst have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The ﬁrst involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We ﬁrst present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be reﬁned by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by ﬁrst choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artiﬁcial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difﬁcult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reﬂected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the ﬁles me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classiﬁcation tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reﬂect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

6 0.52147758 129 nips-2007-Mining Internet-Scale Software Repositories

7 0.50587213 189 nips-2007-Supervised Topic Models

8 0.48287341 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

9 0.48111936 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

10 0.4679786 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

11 0.45705435 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

12 0.45655543 47 nips-2007-Collapsed Variational Inference for HDP

13 0.43354046 183 nips-2007-Spatial Latent Dirichlet Allocation

14 0.38925415 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

15 0.37609702 197 nips-2007-The Infinite Markov Model

16 0.32650146 63 nips-2007-Convex Relaxations of Latent Variable Training

17 0.25354975 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

18 0.24848144 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

19 0.24666741 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

20 0.22165333 200 nips-2007-The Tradeoffs of Large Scale Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.051), (13, 0.148), (16, 0.022), (18, 0.017), (21, 0.049), (31, 0.023), (34, 0.012), (35, 0.02), (45, 0.017), (47, 0.055), (49, 0.024), (82, 0.02), (83, 0.069), (85, 0.027), (87, 0.082), (90, 0.039), (96, 0.239)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82073295 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

2 0.60821122 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

Author: Fred Richardson, William M. Campbell

Abstract: Many tasks in speech processing involve classiﬁcation of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to ﬁrst convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating ﬁlter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and signiﬁcant qualitative and quantitative results.

3 0.5946275 191 nips-2007-Temporal Difference Updating without a Learning Rate

Author: Marcus Hutter, Shane Legg

Abstract: We derive an equation for temporal difference learning from statistical principles. Speciﬁcally, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that speciﬁes the learning rate. In the place of this free parameter there is now an equation for the learning rate that is speciﬁc to each state transition. We experimentally test this new learning rule against TD(λ) and ﬁnd that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and ﬁnd that it again offers superior performance without a learning rate parameter. 1

4 0.58132094 14 nips-2007-A configurable analog VLSI neural network with spiking neurons and self-regulating plastic synapses

Author: Massimiliano Giulioni, Mario Pannunzi, Davide Badoni, Vittorio Dante, Paolo D. Giudice

Abstract: We summarize the implementation of an analog VLSI chip hosting a network of 32 integrate-and-ﬁre (IF) neurons with spike-frequency adaptation and 2,048 Hebbian plastic bistable spike-driven stochastic synapses endowed with a selfregulating mechanism which stops unnecessary synaptic changes. The synaptic matrix can be ﬂexibly conﬁgured and provides both recurrent and AER-based connectivity with external, AER compliant devices. We demonstrate the ability of the network to efﬁciently classify overlapping patterns, thanks to the self-regulating mechanism.

5 0.57838476 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

6 0.5724768 62 nips-2007-Convex Learning with Invariances

7 0.52269346 84 nips-2007-Expectation Maximization and Posterior Constraints

8 0.50221729 59 nips-2007-Continuous Time Particle Filtering for fMRI

9 0.49422422 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

10 0.48927921 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

11 0.48635662 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons

12 0.48503554 9 nips-2007-A Probabilistic Approach to Language Change

13 0.47989452 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

14 0.47276306 86 nips-2007-Exponential Family Predictive Representations of State

15 0.47034186 63 nips-2007-Convex Relaxations of Latent Variable Training

16 0.46988499 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

17 0.46932575 189 nips-2007-Supervised Topic Models

18 0.46903229 113 nips-2007-Learning Visual Attributes

19 0.46642771 102 nips-2007-Incremental Natural Actor-Critic Algorithms

20 0.46625191 129 nips-2007-Mining Internet-Scale Software Repositories