nips nips2007 nips2007-9 knowledge-graph by maker-knowledge-mining

9 nips-2007-A Probabilistic Approach to Language Change

Source: pdf

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for deﬁning probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efﬁcient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We use this framework to explore the consequences of two different schemes for deﬁning probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. [sent-4, score-0.912]

2 The result is an efﬁcient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. [sent-5, score-0.644]

3 A classical example is the hypothetical Proto-Indo-European language, the reconstructed common ancestor of the modern Indo-European languages. [sent-8, score-0.216]

4 The study of how languages change over time is known as diachronic (or historical) linguistics (e. [sent-10, score-0.561]

5 Most of what we know about language change comes from the comparative method, in which words from different languages are compared in order to identify their relationships. [sent-13, score-0.455]

6 The goal is to identify regular sound correspondences between languages and use these correspondences to infer the forms of proto-languages and the phylogenetic relationships between languages. [sent-14, score-0.571]

7 The motivation for basing the analysis on sounds is that phonological changes are generally more systematic than syntactic or morphological changes. [sent-15, score-0.369]

8 Comparisons of words from different languages are traditionally carried out by hand, introducing an element of subjectivity into diachronic linguistics. [sent-16, score-0.482]

9 Early attempts to quantify the similarity between languages (e. [sent-17, score-0.216]

10 , [15]) made drastic simplifying assumptions that drew strong criticism from diachronic linguists. [sent-19, score-0.216]

11 In particular, many of these approaches simply represent the appearance of a word in two languages with a single bit, rather than allowing for gradations based on correspondences between sequences of phonemes. [sent-20, score-0.502]

12 We take a quantitative approach to diachronic linguistics that alleviates this problem by operating at the phoneme level. [sent-21, score-0.495]

13 Following [3], we use this information to estimate a contextualized model of phonological change expressed as a probability distribution over rules applied to individual phonemes. [sent-26, score-0.582]

14 For example, we can reconstruct ancestral word forms or inspect the rules learned along each branch of 1 a phylogeny to identify sound laws. [sent-28, score-0.881]

15 Alternatively, we can observe a word in one or more modern languages, say French and Spanish, and query the corresponding word form in another language, say Italian. [sent-29, score-0.592]

16 Finally, models of this kind can potentially be used as a building block in a system for inferring the topology of phylogenetic trees [3]. [sent-30, score-0.244]

17 This approach makes it difﬁcult to capture rules that apply at different levels of granularity. [sent-33, score-0.262]

18 Inspired by the prevalence of multi-scale rules in diachronic phonology and modern phonological theory, we develop a new scheme in which rules possess a set of features, and a distribution over rules is deﬁned using a loglinear model. [sent-34, score-1.432]

19 We evaluate both schemes in reconstructing ancient word forms, showing that the new linguistically-motivated change can improve performance signiﬁcantly. [sent-35, score-0.458]

20 2 Background and previous work Most previous computational approaches to diachronic linguistics have focused on the reconstruction of phylogenetic trees from a Boolean matrix indicating the properties of words in different languages [10, 6, 14, 13]. [sent-36, score-0.726]

21 These approaches descend from glottochronology [15], which measures the similarity between languages (and the time since they diverged) using the number of words in those languages that belong to the same cognate set. [sent-37, score-0.781]

22 This information is obtained from manually curated cognate lists such as the data of [5]. [sent-38, score-0.302]

23 The modern instantiations of this approach rely on sophisticated techniques for inferring phylogenies borrowed from evolutionary biology (e. [sent-39, score-0.233]

24 However, they still generally use cognate sets as the basic data for evaluating the similarity between languages (although some approaches incorporate additional manually constructed features [14]). [sent-42, score-0.512]

25 As an example of a cognate set encoding, consider the meaning “eat”. [sent-43, score-0.27]

26 There would be one column for the cognate set which appears in French as manger and Italian as mangiare since both descend from the Latin mandere (to chew). [sent-44, score-0.299]

27 There would be another column for the cognate set which appears in both Spanish and Portuguese as comer, descending from the Latin comedere (to consume). [sent-45, score-0.27]

28 In contrast, each word in our work is tracked using an automatically obtained cognate list. [sent-49, score-0.518]

29 While these cognates may be noisier, we compensate for this by modeling phonological changes rather than Boolean mutations in cognate sets. [sent-50, score-0.711]

30 Another line of computational work has explored using phonological models as a way to capture the differences between languages. [sent-51, score-0.336]

31 They use a probabilistic edit model, but do not consider the reconstruction of ancient word forms, nor do they present a learning algorithm for such models. [sent-53, score-0.764]

32 There have also been several approaches to the problem of cognate prediction in machine translation (essentially transliteration), e. [sent-54, score-0.27]

33 [12] presents a model for learning “sound laws,” general phonological changes governing two completely observed aligned cognate lists. [sent-58, score-0.639]

34 3 A generative model of phonological change In this section, we outline the framework for modeling phonological change that we will use throughout the paper. [sent-60, score-0.719]

35 Assume we have a ﬁxed set of word types (cognate sets) in our vocabulary V and a set of languages L. [sent-61, score-0.464]

36 Each word type i has a word form wil in each language l ∈ L, which is represented as a sequence of phonemes which might or might not be observed. [sent-62, score-0.769]

37 The languages are arranged according to some tree topology T (see Figure 2(a) for examples). [sent-63, score-0.329]

38 It is possible to also induce the topology or cognate set assignments, but in this paper we assume that the topology is ﬁxed and cognates have already been identiﬁed. [sent-64, score-0.568]

39 |V | Edits applied Rules used (b) Example of edits (c) Graphical model Figure 1: (a) A description of the generative model. [sent-68, score-0.213]

40 (b) An example of edits that were used to transform the Latin word focus (/fokus/) into the Italian word fuoco (/fwOko/) (ﬁre) along with the context-speciﬁc rules that were applied. [sent-69, score-0.948]

41 (c) The graphical model representation of our model: θ are the parameters specifying the stochastic edits e, which govern how the words w evolve. [sent-70, score-0.23]

42 The probabilistic model speciﬁes a distribution over the word forms {wil } for each word type i ∈ V and each language l ∈ L via a simple generative process (Figure 1(a)). [sent-71, score-0.8]

43 The generative process starts at the root language and generates all the word forms in each language in a top-down manner. [sent-72, score-0.678]

44 A root word form w n consisting of n phonemes x1 · · · xn is generated with probability plm (x1 ) = j=2 plm (xj | xj−1 ), where plm is the distribution of the language model. [sent-74, score-0.623]

45 The stochastic edit model w ∼ Edit(w, θ) describes how a single old word form w = x1 · · · xn changes along one branch of the phylogeny with parameters θ to produce a new word form w . [sent-75, score-1.108]

46 The generative process used in the edit model is as follows: for each phoneme xi in the old word form, walking from left to right, choose a rule to apply. [sent-77, score-0.821]

47 There are three types of rules: (1) deletion of the phoneme, (2) substitution with some phoneme (possibly the same one), or (3) insertion of another phoneme, either before or after the existing one. [sent-78, score-0.266]

48 Context-dependent rules are often used to characterize phonological changes in diachronic linguistics [4]. [sent-80, score-0.923]

49 Figure 1(b) shows an example of the rules being applied. [sent-81, score-0.239]

50 The context-dependent form of these rules allows us to represent phenomena such as the likely deletion of s in word-ﬁnal positions. [sent-82, score-0.36]

51 4 Deﬁning distributions over rules In the model deﬁned in the previous section, each branch (k → l) ∈ T has a collection of contextdependent rule probabilities θk→l . [sent-83, score-0.425]

52 These correspond to looking at the place of articulation (denoted A2 (c)), testing whether c is a vowel, consonant, or boundary symbol (A1 (c)), and the trivial wildcard partition (A0 (c)), which allows rules to be insensitive to c. [sent-105, score-0.239]

53 It also provides a connection with contemporary phonological theory. [sent-108, score-0.313]

54 Recent work in computational linguistics on probabilistic forms of optimality theory has begun to use a similar approach, characterizing the distribution over word forms within a language using a log-linear model applied to features of the words [17, 9]. [sent-109, score-0.801]

55 Using similar features to deﬁne a distribution over phonological changes thus provides a connection between synchronic and diachronic linguistics in addition to a linguistically-motivated method for improving reconstruction. [sent-110, score-0.746]

56 The algorithm iterates between a stochastic E-step, which computes reconstructions based on the current edit parameters, and an M-step, which updates the edit parameters based on the reconstructions. [sent-112, score-0.692]

57 1 Monte Carlo E-step: sampling the edits The E-step computes the expected sufﬁcient statistics required for the M-step, which in our case is the expected number of times each edit (such as o → O) was used in each context. [sent-114, score-0.503]

58 An exact E-step would require summing over all possible edits involving all languages in the phylogeny (all unobserved {e}, {w} variables in Figure 1(c)), which does not permit a tractable dynamic program. [sent-116, score-0.459]

59 Therefore, we resort to a Monte Carlo E-step, where many samples of the edit variables are collected, and counts are computed based on these samples. [sent-117, score-0.323]

60 Samples are drawn using Gibbs sampling [8]: for each word form of a particular language wil , we ﬁx all other variables in the model and sample wil along with its corresponding edits. [sent-118, score-0.591]

61 Suppose that the words in languages A, C and D are ﬁxed, and we wish to sample the word at language B along with the three corresponding sets of edits (remember that the edits fully determine the words). [sent-120, score-1.037]

62 While there are an exponential number of possible words/edits, we can exploit the Markov structure in the edit model to consider all such words/edits using dynamic programming, in a way broadly similar to the forward-backward algorithm for HMMs. [sent-121, score-0.323]

63 4 la vl la es ib it it Experiment Latin reconstruction (6. [sent-123, score-0.358]

64 The heldout column indicates how many words, if any, were held out for edit distance evaluation, and from which language. [sent-127, score-0.429]

65 2 M-step: updating the parameters In the M-step, we estimate the distribution over rules for each branch (k → l). [sent-130, score-0.352]

66 6 Experiments In this section, we summarize the results of the experiments testing our different probabilistic models of phonological change. [sent-139, score-0.347]

67 1 Reconstruction of ancient word forms We ran the two models using Topology 1 in Figure 2 to assess the relative performance of Dirichletparametrized versus log-linear-parametrized models. [sent-143, score-0.463]

68 Half of the Latin words at the root of the tree were held out, and the (uniform cost) Levenshtein edit distance from the predicted reconstruction to the truth was computed. [sent-144, score-0.513]

69 While the uniform-cost edit distance misses important aspects of phonology (all phoneme substitutions are not equal, for instance), it is parameter-free and still seems to correlate to a large extent with linguistic quality of reconstruction. [sent-145, score-0.607]

70 10 Improvement 7% 11% 12% 14% Table 1: Results of the edit distance experiment. [sent-155, score-0.349]

71 The language column corresponds to the language held out for evaluation. [sent-156, score-0.293]

72 We show the mean edit distance across the evaluation examples. [sent-157, score-0.349]

73 /dEntis/ i →E E→jE s→ /djEntes/ /dEnti/ Figure 3: An example of the proper Latin reconstruction given the Spanish and Italian word forms. [sent-161, score-0.299]

74 Our baseline for comparison was picking randomly, for each heldout node in the tree, an observed neighboring word (i. [sent-164, score-0.319]

75 Both models outperformed this baseline (see Figure 3), and the log-linear model outperformed the Dirichlet model, suggesting that the featurized system better captures the phonological changes. [sent-167, score-0.373]

76 Moreover, adding more features further improved the performance, indicating that being able to express rules at multiple levels of granularity allows the model to capture the underlying phonological changes more accurately. [sent-168, score-0.707]

77 2 Inference of phonological changes Another use of this model is to automatically recover the phonological drift processes between known or partially-known languages. [sent-174, score-0.682]

78 One of the nodes characterizes the least common ancestor of modern Spanish and Portuguese; the other, the least common ancestor of all three modern languages. [sent-177, score-0.268]

79 Nonetheless, the major reconstructed rules still correspond to well-known phenomena and the learned model generally places them on reasonable branches. [sent-180, score-0.316]

80 Figure 4 shows the top four general rules for each of the evolutionary branches recovered by the log-linear model. [sent-181, score-0.425]

81 The rules are ranked by the number of times they were used in the derivations during the last iteration of EM. [sent-182, score-0.239]

82 The la, es, pt, and it forms are fully observed while the vl and ib forms are automatically reconstructed. [sent-183, score-0.351]

83 Figure 4 also shows a speciﬁc example of the evolution of the Latin VERBUM (word), along with the speciﬁc edits employed by the model. [sent-184, score-0.213]

84 rules that are not identities) used at each stage are shown along the corresponding edge. [sent-191, score-0.272]

85 The boxes display the top four nontrivial rules corresponding to each of these evolutionary branches, ordered by the number of times they were applied during the last E step. [sent-192, score-0.339]

86 rules must be redundantly discovered, which tends to ﬂood the top of the rule lists with duplicates. [sent-195, score-0.38]

87 In contrast, the log-linear model groups rules with features of the appropriate degree of generality. [sent-196, score-0.265]

88 While quantitative evaluation such as measuring edit distance is helpful for comparing results, it is also illuminating to consider the plausibility of the learned parameters in a historical light, which we do here brieﬂy. [sent-197, score-0.433]

89 In particular, we consider rules on the branch between la and vl, for which we have historical evidence. [sent-198, score-0.492]

90 The Appendix lists common misspellings of Latin words, from which phonological changes can be inferred. [sent-200, score-0.401]

91 On the la to vl branch, rules for word-ﬁnal deletion of classical case markers dominate the list. [sent-201, score-0.53]

92 Similarly, major canonical rules were discovered in other branches as well, for example, /v/ to /b/ fortition in Spanish, palatalization along several branches, and so on. [sent-209, score-0.358]

93 Of course, the recovered words and rules are not perfect. [sent-210, score-0.289]

94 For example, reconstructed Ibero /trinta/ to Spanish /treinta/ (thirty) is generated in an odd fashion using rules /e/ to /i/ and /n/ to /in/. [sent-211, score-0.289]

95 7 Conclusion Probabilistic models have the potential to replace traditional methods used for comparing languages in diachronic linguistics with quantitative methods for reconstructing word forms and inferring phylogenies. [sent-214, score-0.994]

96 In this paper, we presented a novel probabilistic model of phonological change, in which the rules governing changes in the sound of words are parametrized using the features of the phonemes involved. [sent-215, score-0.849]

97 This model goes beyond previous work in this area, providing more accurate reconstructions of ancient word forms and connections to current work on phonology in synchronic linguistics. [sent-216, score-0.635]

98 Using a log-linear model to deﬁne the probability of a rule being applied results in a 7 straightforward inference procedure which can be used to both produce accurate reconstructions as measured by edit distance and identify linguistically plausible rules that account for phonological changes. [sent-217, score-1.02]

99 We believe that this probabilistic approach has the potential to support quantitative analysis of the history of languages in a way that can scale to large datasets while remaining sensitive to the concerns that have traditionally motivated diachronic linguistics. [sent-218, score-0.495]

100 Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. [sent-299, score-0.236]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('edit', 0.323), ('phonological', 0.313), ('latin', 0.297), ('cognate', 0.27), ('word', 0.248), ('rules', 0.239), ('diachronic', 0.216), ('languages', 0.216), ('edits', 0.18), ('spanish', 0.151), ('language', 0.13), ('phoneme', 0.12), ('branch', 0.113), ('topology', 0.113), ('ancient', 0.108), ('forms', 0.107), ('evolutionary', 0.1), ('linguistics', 0.099), ('modern', 0.096), ('deletion', 0.094), ('phylogenetic', 0.094), ('portuguese', 0.094), ('phonology', 0.09), ('vulgar', 0.09), ('wil', 0.09), ('branches', 0.086), ('la', 0.085), ('vl', 0.08), ('sound', 0.078), ('dirichlet', 0.078), ('rule', 0.073), ('cognates', 0.072), ('italian', 0.071), ('phylogeny', 0.063), ('fj', 0.062), ('french', 0.057), ('ib', 0.057), ('changes', 0.056), ('historical', 0.055), ('centuries', 0.054), ('plm', 0.054), ('probi', 0.054), ('ringe', 0.054), ('phonemes', 0.053), ('reconstruction', 0.051), ('cr', 0.05), ('granularity', 0.05), ('reconstructed', 0.05), ('words', 0.05), ('linguistic', 0.048), ('heldout', 0.047), ('reconstructions', 0.046), ('romance', 0.043), ('reconstructing', 0.042), ('cl', 0.041), ('ancestor', 0.038), ('correspondences', 0.038), ('inferring', 0.037), ('alv', 0.036), ('eib', 0.036), ('featurized', 0.036), ('languagemodel', 0.036), ('oe', 0.036), ('redundantly', 0.036), ('synchronic', 0.036), ('verbum', 0.036), ('pt', 0.036), ('appendix', 0.035), ('probabilistic', 0.034), ('generative', 0.033), ('multinomial', 0.033), ('along', 0.033), ('fellowship', 0.033), ('held', 0.033), ('classical', 0.032), ('lists', 0.032), ('alleviates', 0.031), ('contexts', 0.031), ('root', 0.03), ('schemes', 0.03), ('change', 0.03), ('comparative', 0.029), ('quantitative', 0.029), ('descend', 0.029), ('phenomena', 0.027), ('insertion', 0.027), ('features', 0.026), ('distance', 0.026), ('laws', 0.025), ('substitution', 0.025), ('author', 0.025), ('non', 0.024), ('nal', 0.024), ('liang', 0.024), ('old', 0.024), ('baseline', 0.024), ('capture', 0.023), ('ar', 0.023), ('boolean', 0.023), ('overly', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

2 0.14542753 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Author: Kristina Toutanova, Mark Johnson

Abstract: We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words’ distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset. 1

3 0.10112533 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

4 0.096562736 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1

5 0.094546556 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-ﬂow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.

6 0.091694266 197 nips-2007-The Infinite Markov Model

7 0.085745245 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

8 0.073684998 183 nips-2007-Spatial Latent Dirichlet Allocation

9 0.069284149 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

10 0.068067856 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

11 0.066663615 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

12 0.063034877 114 nips-2007-Learning and using relational theories

13 0.059525806 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

14 0.056343682 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

15 0.051895946 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

16 0.049922507 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

17 0.047246788 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers

18 0.04578938 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

19 0.043672159 189 nips-2007-Supervised Topic Models

20 0.042185556 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.135), (1, 0.047), (2, -0.028), (3, -0.179), (4, 0.052), (5, -0.054), (6, 0.03), (7, -0.065), (8, -0.028), (9, -0.023), (10, -0.0), (11, 0.026), (12, -0.029), (13, -0.134), (14, -0.02), (15, -0.164), (16, -0.144), (17, 0.042), (18, -0.029), (19, 0.06), (20, -0.011), (21, -0.015), (22, 0.039), (23, -0.041), (24, 0.011), (25, -0.035), (26, -0.015), (27, -0.016), (28, -0.116), (29, 0.1), (30, 0.054), (31, 0.002), (32, -0.044), (33, 0.024), (34, 0.064), (35, -0.058), (36, -0.065), (37, -0.075), (38, -0.03), (39, -0.003), (40, -0.021), (41, -0.007), (42, -0.038), (43, 0.062), (44, -0.041), (45, -0.003), (46, 0.022), (47, 0.088), (48, 0.062), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94829583 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

2 0.8012554 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and ﬁnd it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difﬁculty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na ﬁvy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must ﬁrst learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must ﬁrst have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The ﬁrst involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We ﬁrst present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be reﬁned by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by ﬁrst choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artiﬁcial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difﬁcult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reﬂected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the ﬁles me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classiﬁcation tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reﬂect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

3 0.67579067 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

4 0.67404872 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Author: Kristina Toutanova, Mark Johnson

5 0.66547072 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

6 0.64745885 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

7 0.56281799 197 nips-2007-The Infinite Markov Model

8 0.52412313 22 nips-2007-Agreement-Based Learning

9 0.47662142 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

10 0.38781282 114 nips-2007-Learning and using relational theories

11 0.3822875 29 nips-2007-Automatic Generation of Social Tags for Music Recommendation

12 0.34833723 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

13 0.34815648 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

14 0.33770242 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

15 0.32387224 63 nips-2007-Convex Relaxations of Latent Variable Training

16 0.30716509 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

17 0.30290362 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis

18 0.30248463 150 nips-2007-Optimal models of sound localization by barn owls

19 0.30116823 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements

20 0.29648426 183 nips-2007-Spatial Latent Dirichlet Allocation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.041), (13, 0.103), (16, 0.015), (18, 0.013), (21, 0.045), (34, 0.022), (35, 0.038), (40, 0.315), (47, 0.075), (49, 0.013), (83, 0.112), (85, 0.041), (87, 0.041), (90, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75380194 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

2 0.66206747 200 nips-2007-The Tradeoffs of Large Scale Learning

Author: Olivier Bousquet, Léon Bottou

Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data. • Artiﬁcial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efﬁciently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data. This contribution ﬁnds its source in the idea that approximate optimization algorithms might be sufﬁcient for learning purposes. The ﬁrst part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error. 2 2.1 Approximate Optimization Setup Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a probability distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output y and the real output ˆ y is measured with a loss function ℓ(ˆ, y). Our benchmark is the function f ∗ that minimizes the y expected risk E(f ) = that is, ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)], f ∗ (x) = arg min E [ ℓ(ˆ, y)| x]. y y ˆ Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn training examples (xi , yi ), i = 1 . . . n. We deﬁne the empirical risk En (f ) = 1 n n ℓ(f (xi ), yi ) = En [ℓ(f (x), y)]. i=1 Our ﬁrst learning principle consists in choosing a family F of candidate prediction functions and ﬁnding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufﬁciently restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also deﬁne ∗ ∗ fF = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF and fn are well deﬁned and unique. We can then decompose the excess error as ∗ ∗ E [E(fn ) − E(f ∗ )] = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the expected risk E(f ). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families1 of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of the number of examples [7, 8]. 2.2 Optimization Error Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation. Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should not be necessary to carry out this minimization with great accuracy. For instance, we could stop an iterative optimization algorithm long before its convergence. ˜ Let us assume that our minimization algorithm returns an approximate solution fn such that ˜ En (fn ) < En (fn ) + ρ ˜ where ρ ≥ 0 is a predeﬁned tolerance. An additional term Eopt = E E(fn ) − E(fn ) then appears ˜ ) − E(f ∗ ) : in the decomposition of the excess error E = E E(fn E ∗ ∗ ˜ = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] + E E(fn ) − E(fn ) = Eapp + Eest + Eopt . (2) We call this additional term optimization error. It reﬂects the impact of the approximate optimization on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.) 1 We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable choice of the Lagrange coefﬁcient λ. We can then control the estimation-approximation tradeoff by choosing λ instead of c. 2.3 The Approximation–Estimation–Optimization Tradeoff This decomposition leads to a more complicated compromise. It involves three variables and two constraints. The constraints are the maximal number of available training example and the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, and the number of examples n. This is formalized by the following optimization problem. min E = Eapp + Eest + Eopt F ,ρ,n n ≤ nmax T (F, ρ, n) ≤ Tmax subject to (3) The number n of training examples is a variable because we could choose to use only a subset of the available training examples in order to complete the optimization within the alloted time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, and ρ increase. Table 1: Typical variations when F, n, and ρ increase. F Eapp Eest Eopt T (approximation error) (estimation error) (optimization error) (computation time) n ρ ց ր ··· ր ց ··· ր ր ց The solution of the optimization program (3) depends critically of which budget constraint is active: constraint n < nmax on the number of examples, or constraint T < Tmax on the training time. • We speak of small-scale learning problem when (3) is constrained by the maximal number of examples nmax . Since the computing time is not limited, we can reduce the optimization error Eopt to insigniﬁcant levels by choosing ρ arbitrarily small. The excess error is then dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax , we recover the approximation-estimation tradeoff that is the object of abundant literature. • We speak of large-scale learning problem when (3) is constrained by the maximal computing time Tmax . Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processed during the allowed time. The speciﬁcs depend on the computational properties of the chosen optimization algorithm through the expression of the computing time T (F, ρ, n). 3 The Asymptotics of Large-scale Learning In the previous section, we have extended the classical approximation-estimation tradeoff by taking into account the optimization error. We have given an objective criterion to distiguish small-scale and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between approximation and estimation. The large-scale case is substantially different because it involves the computational complexity of the learning algorithm. In order to clarify the large-scale learning tradeoff with sufﬁcient generality, this section makes several simpliﬁcations: • We are studying upper bounds of the approximation, estimation, and optimization errors (2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates [9, 10, 11, 12]. Another way to ﬁnd comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. • We are studying the asymptotic properties of the tradeoff when the problem size increases. Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt ) and only need to ensure that the three terms decrease with the same asymptotic rate. • We are considering a ﬁxed family of functions F and therefore avoid taking into account the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical realities such as choosing models and choosing features. In the context of this work, we do not believe we can meaningfully address this without discussing, for instance, the thorny issue of feature selection. Instead we focus on the choice of optimization algorithm. • Finally, in order to keep this paper short, we consider that the family of functions F is linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded, ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz. We ﬁrst explain how the uniform convergence bounds provide convergence rates that take the optimization error into account. Then we discuss and compare the asymptotic learning properties of several optimization algorithms. 3.1 Convergence of the Estimation and Optimization Errors The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy ˜ ρ involves the empirical quantity En (fn ) − En (fn ), whereas the optimization error Eopt involves ˜ its expected counterpart E(fn ) − E(fn ). This section discusses the impact on the optimization error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].) In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c do not necessarily imply that the constants have identical values. 3.1.1 Simple Uniform Convergence Bounds Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence results then state that r » – E sup |E(f ) − En (f )| ≤ c f ∈F d , n where the expectation is taken with respect to the random choice of the training set.2 This result immediately provides a bound on the estimation error: Eest = ≤ ´ ` ` ∗ ´ ∗ ∗ ´˜ E(fn ) − En (fn ) + En (fn ) − En (fF ) + En (fF ) − E(fF ) r » – d . 2 E sup |E(f ) − En (f )| ≤ c n f ∈F E ˆ` This same result also provides a combined bound for the estimation and optimization errors: Eest + Eopt = + ≤ ˆ ˜ ˆ ˜ ˜ ˜ ˜ E E(fn ) − En (fn ) + E En (fn ) − En (fn ) ∗ ∗ ∗ E [En (fn ) − En (fF )] + E [En (fF ) − E(fF )] r r r ! d d d c +ρ+0+c = c ρ+ . n n n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticated bounds are required. 3.1.2 Faster Rates in the Realizable Case When the loss functions ℓ(ˆ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform y convergence bounds state that r E(f ) − En (f ) d n τ p sup log + . ≤c n d n f ∈F E(f ) This result is very useful because it provides faster convergence rates O(log n/n) in the realizable case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0, ˜ En (fn ) ≤ ρ, and we can write r q n d τ ˜ ˜ E(fn ) − ρ ≤ c E(fn ) log + . n d n q 2 d Although the original Vapnik-Chervonenkis bounds have the form c n log n , the logarithmic term can d be eliminated using the “chaining” technique (e.g., [10].) Viewing this as a second degree polynomial inequality in variable ˜ E(fn ), we obtain „ « d n τ ˜ E(fn ) ≤ c ρ + log + . n d n Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence rate of the combined estimation and optimization error: „ « h i h i d n ∗ ˜ ˜ Eest + Eopt = E E(fn ) − E(fF ) ≤ E E(fn ) = c ρ + log . n d 3.1.3 Fast Rate Bounds Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions. These bounds have the general form α n 1 d log for ≤ α ≤ 1. n d 2 This result holds when one can establish the following variance condition: Eapp + Eest ≤ c ∀f ∈ F Eapp + ∗ ℓ(f (X), Y ) − ℓ(fF (X), Y ) E 2 ≤ c ∗ E(f ) − E(fF ) (4) 1 2− α . (5) The convergence rate of (4) is described by the exponent α which is determined by the quality of the variance bound (5). Works on fast statistical estimation identify two main ways to establish such a variance condition. • Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ˆ, y) = (ˆ − y)2 . y y • Making assumptions on the data distribution. In the case of pattern recognition problems, for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in section 3.1.2 can be viewed as an extreme case of this. Despite their much greater complexity, fast rate estimation results can accomodate the optimization accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a bound of the form α d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6) n d For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the following result: d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log + ρ . (7) n d See also [15, 4] for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms We now discuss and compare the asymptotic learning properties of four gradient optimization algo∗ rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF and wn ∗ correspond to the functions fF and fn deﬁned in section 2.1. In this section, we assume that the functions w → ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum wn . ∂ 2 ℓ(fwn (x), y) ∂2C (wn ) = En , 2 ∂w ∂w2 H = G = En ∂ℓ(fwn (x), y) ∂w ∂ℓ(fwn (x), y) ∂w (8) ′ . (9) The relation between these two matrices depends on the chosen loss function. In order to summarize them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 − η : tr(G H −1 ) ≤ ν EigenSpectrum(H) ⊂ [ λmin , λmax ] and (10) The condition number κ = λmax /λmin is a good indicator of the difﬁculty of the optimization [16]. The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this condition only implies strict convexity around the optimum. For instance, consider the loss function ℓ is obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1 − yz} in a small neighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here. The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector. • Gradient Descent (GD) iterates w(t + 1) = w(t) − η ∂C 1 (w(t)) = w(t) − η ∂w n n i=1 ∂ ℓ fw(t) (xi ), yi ∂w where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations depends on the choice of the initial parameter vector. • Second Order Gradient Descent (2GD) iterates n w(t + 1) = w(t) − H −1 1 ∂ ∂C (w(t)) = w(t) − H −1 ℓ fw(t) (xi ), yi ∂w n ∂w i=1 where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than Newton’s algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufﬁcient. In the general case, O(log log(1/ρ)) iterations are required to reach accuracy ρ. • Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each iteration and updates the parameter w on the basis of this example only, w(t + 1) = w(t) − η ∂ ℓ fw(t) (xt ), yt . t ∂w Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t) − wn . We can then write ES [ C(w(t)) − inf C ] = = ≤ ˆ ` ´˜ ` ´ ES tr H δw(t) δw(t)′ + o 1 t ` ´ ` ´ tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1 t `1´ `1´ 2 tr(GH) + o t ≤ νκ + o t . t t (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ρ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound. Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legend: n number of examples; d parameter dimension; κ, ν see equation (10). Algorithm Cost of one iteration GD O(nd) 2GD O d2 + nd SGD O(d) 2SGD O d2 Iterations to reach ρ O κ log 1 ρ 1 O log log ρ νκ2 ρ ν ρ O ndκ log O 1 ρ +o +o Time to reach accuracy ρ 1 ρ 1 ρ d2 + nd log log O O Time to reach E ≤ c (Eapp + ε) d2 κ ε1/α O 1 ρ O d2 ε1/α dνκ2 ρ d2 ν ρ log2 1 ε log 1 log log 1 ε ε O O d ν κ2 ε d2 ν ε • Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of the Hessian matrix H: w(t + 1) = w(t) − 1 −1 ∂ H ℓ fw(t) (xt ), yt . t ∂w Unlike standard gradient algorithms, using the second order information does not change the inﬂuence of ρ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations. For each of the four gradient algorithms, the ﬁrst three columns of table 2 report the time for a single iteration, the number of iterations needed to reach a predeﬁned accuracy ρ, and their product, the time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bounds the time necessary to reduce the excess error E below c (Eapp +ε) where c `d ´α is the constant from (6). This is computed by observing that choosing ρ ∼ n log n in (6) achieves d the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences d ρ ∼ ε and n ∼ ε1/α log 1 . Setting the fourth column expressions to Tmax and solving for ǫ yields ε the best excess error achieved by each algorithm within the limited time Tmax . This provides the asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the estimation procedure and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences: • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error. • Second order algorithms bring little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by the polynomial term in (1/ε). However, there are important variations in the inﬂuence of the constants d, κ and ν. These constants are very important in practice. • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms. This had been described before [18] and observed in experiments. In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to insigniﬁcant levels, their generalization performance is solely determined by the statistical properties of their estimation procedure. 4 Conclusion Taking in account budget constraints on both the number of examples and the computation time, we ﬁnd qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the estimation procedure and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms. Considerable reﬁnements of this framework can be expected. Extending the analysis to regularized risk formulations would make results on the complexity of primal and dual optimization algorithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a non-trivial impact in the large-scale case. Acknowledgments Part of this work was funded by NSF grant CCR-0325463. References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984. [2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982. [3] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classiﬁcation: a survey of recent e a advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006. [5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988. [6] Richard O. Duda and Peter E. Hart. Pattern Classiﬁcation And Scene Analysis. Wiley and Son, 1973. [7] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. [8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag. [9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. [10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [11] Alexandre B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. Annals of Statististics, 32(1), 2004. [12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classiﬁcation and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des e Sciences de Toulouse, series 6, 9(2):245–303, 2000. [14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003. [16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983. [17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. [18] L´ on Bottou and Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, e and Bernhard Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. [19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM. [20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.

3 0.49913543 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

4 0.49805456 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

Author: Fred Richardson, William M. Campbell

Abstract: Many tasks in speech processing involve classiﬁcation of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to ﬁrst convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating ﬁlter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and signiﬁcant qualitative and quantitative results.

5 0.49532828 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

6 0.49410978 84 nips-2007-Expectation Maximization and Posterior Constraints

7 0.49329585 22 nips-2007-Agreement-Based Learning

8 0.48382843 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.48203954 191 nips-2007-Temporal Difference Updating without a Learning Rate

10 0.47622088 86 nips-2007-Exponential Family Predictive Representations of State

11 0.47510374 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

12 0.4725984 102 nips-2007-Incremental Natural Actor-Critic Algorithms

13 0.47072694 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

14 0.46980232 134 nips-2007-Multi-Task Learning via Conic Programming

15 0.46948406 115 nips-2007-Learning the 2-D Topology of Images

16 0.46859562 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

17 0.46727049 24 nips-2007-An Analysis of Inference with the Universum

18 0.46709648 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

19 0.46707404 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

20 0.46538508 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine