nips nips2010 nips2010-264 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. [sent-14, score-0.503]
2 These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. [sent-16, score-0.444]
3 We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. [sent-18, score-0.327]
4 1 Introduction Conventional views of language acquisition often assume that human language learners initially use a single source of information to acquire one component of language, which they then use to leverage the acquisition of other linguistic components. [sent-19, score-0.5]
5 For example, Kuhl [1] presents a standard “bootstrapping” view of early language acquisition in which successively more difficult tasks are addressed by learners, beginning with phoneme inventory and progressing to word segmentation and word learning. [sent-20, score-1.273]
6 , Graf Estes et al [2], who showed that infants were more successful in mapping novel objects to novel words after those words had been successfully segmented from the speech stream. [sent-23, score-0.469]
7 Computationally speaking, an interactive account views language acquisition as a joint inference problem for all components of language simultaneously, rather than a discrete sequence of inference problems for individual language components. [sent-25, score-0.499]
8 (We are thus using “interactive” to refer to the way that language acquisition is formulated as an inference problem, rather than a specific mechanism or architecture as in [3]). [sent-26, score-0.296]
9 , where improvements in the acquisition of component A also improves the acqui1 PIG|DOG i z D & t D e p I g PIG Figure 1: The photograph indicates non-linguistic context containing the (toy) pig and dog for the utterance Is that the pig? [sent-31, score-0.588]
10 The objects in the non-linguistic context are indicated by the prefix “PIG|DOG”, which is followed by the unsegmented phonemicised input. [sent-34, score-0.18]
11 The possible word segmentation points are indicated by separators between the phonemes. [sent-35, score-0.604]
12 The correct word segmentation is indicated by the filled blue word separators, and the mapping between words and non-linguistic objects is indicated by the underbrace subscript. [sent-37, score-1.17]
13 sition of component B, and improvements in the acquisition of component B also improves the acquisition of component A. [sent-38, score-0.302]
14 In this paper we focus on the acquisition of two of the simpler aspects of language: (i) segmenting sentences into words (thereby identifying their pronunciations), and (ii) the relationship between words and the objects they refer to. [sent-40, score-0.504]
15 The acquisition of word pronunciations is viewed as a segmentation problem as follows. [sent-43, score-0.762]
16 Following Elman [4] and Brent [5, 6], a corpus of child-directed speech is “phonemicised” by looking each word up in a pronouncing dictionary and concatenating those pronunciations. [sent-44, score-0.452]
17 For example, the mother’s utterance Is that the pig is mapped to the broad phonemic representation Iz D&t; D6 pIg (in an ASCII-based broad phonemic encoding), which are then concatenated to form IzD&tD6pIg. [sent-45, score-0.348]
18 ; The word segmentation task is to segment a corpus of such unsegmented utterance representations into words, thus identifying the pronunciations of the words in the corpus. [sent-46, score-0.94]
19 We study the acquisition of the relationship between words and the objects they refer to using the framework proposed by Frank et al [7]. [sent-47, score-0.55]
20 Here each utterance in the corpus is labelled with the contextually-relevant objects that the speaker might be referring to. [sent-48, score-0.34]
21 For example, in the context of Figure 1, the utterance would be labelled with the two contextually-relevant objects PIG and DOG. [sent-50, score-0.29]
22 Jones et al [8] combined the word segmentation and word reference tasks into a single inference task, where the goal is to simultaneously segment the utterance into words, and to map a subset of the words of each utterance to the utterance’s contextually-relevant objects. [sent-52, score-1.564]
23 The next section summarises previous work on word segmentation and learning the relationship between words and their referents. [sent-55, score-0.71]
24 Section 3 introduces Adaptor Grammars, explains how they can be used for word segmentation and topic modelling, and presents the Adaptor Grammars that will be used in this paper. [sent-56, score-0.621]
25 Section 4 presents experimental 2 results showing synergistic interactions between word segmentation and learning the relationship between words and the objects they refer to, while section 5 summarises and concludes the paper. [sent-57, score-0.841]
26 2 Previous work Word segmentation has been studied using a wide variety of computational perspectives. [sent-58, score-0.196]
27 Elman [4] and Brent [5, 6] introduced the basic word segmentation paradigm investigated here. [sent-59, score-0.578]
28 Johnson et al [11] introduced a generalisation of Probabilistic Context-Free Grammars (PCFGs) called Adaptor Grammars (AGs) as a framework for specifying HDPs for linguistic applications (because this paper relies heavily on AGs we describe them in more detail in section 3 below). [sent-68, score-0.2]
29 Johnson [12] investigated AGs for word segmentation that capture a range of different kinds of generalisations. [sent-69, score-0.578]
30 The unigram AG replicates the unigram segmentation model of Goldwater et al, and suffers from the same undersegmentation problems. [sent-70, score-0.496]
31 The acquisition of the mapping between words and the objects they refer to was studied by Frank et al [7]. [sent-72, score-0.514]
32 They used a modified version of the LDA topic model [13] where the “topics” are contextually-relevant objects that words in the utterance can refer to, so the mapping from “topics” to words effectively specifies which words refer to these contextually-salient objects. [sent-73, score-0.59]
33 Jones et al [8] integrated the Frank et al “topic” model of the word-object relationship with the unigram model of Goldwater et al to obtain a joint model that both performs word segmentation and also learns which words refer to which contextually-salient objects. [sent-74, score-1.378]
34 We use this reduction to express Frank et al models [7] of the word to object relationship as AGs which also incorporate Johnson’s [12] models of word segmentation. [sent-76, score-1.01]
35 The resulting AGs can express a wide range of joint HDP models of word segmentation and the word-object relationship, including the model proposed by Jones et al [8], as well as several generalisations. [sent-77, score-0.744]
36 Informally, θA→α is the probability of a node labelled A expanding to a sequence of nodes labelled α, and the probability of a tree is the product of the probabilities of the rules used to construct each non-leaf node in it. [sent-95, score-0.228]
37 Kurihara et al [17] describe a Variational Bayes algorithm for inferring PCFGs using a mean-field approximation, while Johnson et al [18] describe a Markov Chain Monte Carlo algorithm based on Gibbs sampling. [sent-126, score-0.332]
38 2 Modelling word-object reference using PCFGs This section presents a novel encoding of a Frank et al [7] model for identifying word-object relationships as a PCFG. [sent-128, score-0.211]
39 Because the Frank et al [7] model of the word-object relationship is very similiar to an LDA topic model, we can use the same techniques to design Bayesian PCFGs that infer word-object relationships. [sent-131, score-0.307]
40 The models we investigate in this paper assume that the words in a single sentence refer to at most one non-linguistic object (although it would be easy to relax this restriction). [sent-132, score-0.376]
41 Let O = O ∪ {∅}, where ∅ is a distinguished “null object” not in O, and let the nonterminals N = {S} ∪ {Ao , Bo : o ∈ O }, where Ao and Bo are nonterminals indexed by the o ∈ O. [sent-136, score-0.198]
42 Informally, a nonterminal Bo expanding to word w ∈ V indicates that w refers to object o, while a B∅ expanding to w indicates that w is non-referential. [sent-137, score-0.61]
43 The set of objects in the non-linguistic context of an utterance is indicated by prefixing the utterance with a context identifier associated with those objects, such as “PIG|DOG” in Figure 1. [sent-138, score-0.394]
44 Then the terminals of the 4 S Apig XX Apig Bpig XX Apig B∅ pig XX Apig B∅ the XX Apig B∅ that PIG|DOG is Figure 2: A tree generated by the reference PCFG encoding a Frank et al [7] model of the wordobject relationship. [sent-143, score-0.475]
45 The yield of this tree corresponds to the sentence Is that the pig, and the context identifier is “PIG|DOG”. [sent-144, score-0.266]
46 This grammar generates sentences consisting of a context identifier followed by a sequence of words; e. [sent-147, score-0.223]
47 Informally, the rule expanding S picks an object o that the words in the object can refer to (if o = ∅ then all words in the sentence are non-referential). [sent-150, score-0.592]
48 The first rule expanding Ao ensures that o is a member of that sentence’s non-linguistic context, the second rule generates a Bo that will ultimately generate a word w (which we take to indicate that w refers to o), while the third rule generates a word associated with the null object ∅. [sent-151, score-1.09]
49 A slightly more complicated PCFG, which we call the reference1 grammar, can enforce the requirement that there is at most one referential word in each sentence. [sent-152, score-0.494]
50 S → S B∅ S→c c∈C S → Ao B o o∈O (3) Ao → c c ∈ C, o ∈ c Ao → Ao B ∅ o ∈ O Bo → w o ∈ O ,w ∈ V In this grammar the nonterminal labels function as states that record not just which object a referential word refers to, but also whether that referential word has been generated or not. [sent-157, score-1.237]
51 Viewed top-down, the switch from S to Ao indicates that a word from Bo has just been generated (i. [sent-158, score-0.382]
52 3 Adaptor grammars This subsection briefly reviews adaptor grammars; for more detail see [11]. [sent-163, score-0.474]
53 An Adaptor Grammar (AG) is a septuple (N, W, R, S, θ, A, C) consisting of a PCFG (N, W, R, S, θ) in which a subset A ⊆ N of the nonterminals are identified as adapted, and where each adapted nonterminal X ∈ A has an associated adaptor CX . [sent-164, score-0.514]
54 An adaptor CX for X is a function that maps a distribution over trees TX to a distribution over distributions over TX . [sent-165, score-0.322]
55 Generating a tree associated with an adapted nonterminal involves either reusing an already generated tree from the cache, or else generating a “fresh” tree as in a PCFG. [sent-186, score-0.234]
56 4 Word segmentation with adaptor grammars AGs can be used as models of word segmentation, which we briefly review here; see Johnson [12] for more details. [sent-188, score-1.052]
57 (with its correct segmentation indicated in blue) is as follows: i z D & t D e p I g We can represent any possible segmentation of any possible sentence as a tree generated by the following unigram AG. [sent-191, score-0.781]
58 The trees generated by this adaptor grammar are the same as the trees generated by the CFG rules. [sent-195, score-0.465]
59 For example, the following skeletal parse in which all but the Word nonterminals are suppressed (the others are deterministically inferrable) shows the parse that corresponds to the correct segmentation of the string above. [sent-197, score-0.404]
60 (Word i z) (Word D & t) (Word D e) (Word p I g) Because the Word nonterminal in the AG is adapted (indicated here by underlining) the adaptor grammar learns the probability of the entire Word subtrees (e. [sent-198, score-0.522]
61 This AG implements the unigram segmentation model of Goldwater et al [9], and as explained in section 2, it has the same tendancy to undersegment as the original unigram model. [sent-201, score-0.624]
62 The collocation AG (5) produces a more accurate segmentation because it models (and therefore “explain away”) some of the inter-word dependencies. [sent-202, score-0.395]
63 The collocation AG is a hierarchical process, where the base distribution for the Colloc (collocation) nonterminal adaptor is generated from the Word distribution. [sent-206, score-0.583]
64 The collocation AG generates a sentence as a sequence of Colloc (collocation) nonterminals, each of which is a sequence of Word nonterminals. [sent-207, score-0.454]
65 5 Adaptor grammars for joint segmentation and word-object acquisition This section explains how to combine the word-object reference PCFGs presented in section 3. [sent-211, score-0.58]
66 2 with the word segmentation AGs presented in section 3. [sent-212, score-0.578]
67 Combining the word-object reference PCFGs (2) or (3) with the unigram AG (4) is relatively straight-forward; all we need to do is replace the last rule Bo → w in these grammars with Bo → Phoneme+ , i. [sent-214, score-0.397]
68 , the Bo nonterminals expand to an arbitray sequence of phonemes, and the Bo nonterminals are adapted, so these subtrees are cached and reused as appropriate. [sent-216, score-0.198]
69 This grammar generates a skeletal parses such as the following: (B∅ i z) (B∅ D & t) (B∅ D e) (BPIG p I g) The unigram-reference1 AG is similiar to the unigram-reference AG, except that it stipulates that at most one word per sentence is associated with a (non-null) object. [sent-218, score-0.869]
70 It is also possible to combine the word-object reference PCFGs with the collocation AG. [sent-219, score-0.244]
71 The collocationreference AG is a combination of the collocation AG for word segmentation and the reference PCFG for modelling the word-object relationship. [sent-221, score-0.822]
72 It permits an arbitrary number of words in a sentence to be referential. [sent-222, score-0.293]
73 Interestingly, there are two different reasonable ways of combining the collocation AG with the reference1 PCFG. [sent-223, score-0.199]
74 The collocation-reference1 AG requires that at most one word in a sentence is referential, just like the reference1 PCFG (3). [sent-224, score-0.579]
75 The collocation-referenceC1 AG is similiar to the collocation-reference1 AG, except that it requires that at most one word in a collocation is referential. [sent-225, score-0.643]
76 This means that the collocation-referenceC1 AG permits multiple referential words in a sentence (but they must all refer to the same object). [sent-226, score-0.444]
77 This AG is linguistically plausible because a collocation often consists of a content word, which may be referential, surrounded by function words, which are generally not referential. [sent-227, score-0.199]
78 4 Experimental results We used the same training corpus as Jones et al [8], which was based on the corpus collected by Fernald et al [19] annotated with the objects in the non-linguistic context by Frank et al [7]. [sent-228, score-0.686]
79 For each sentence in each sample we extracted the word segmentation and the word-object relationships the parse implies, so we obtained 2,000 sample analyses for each sentence in the corpus. [sent-234, score-1.01]
80 Perhaps the most basic question is: does non-linguistic context help word segmentation? [sent-238, score-0.416]
81 Jones et al [8] investigated this question by comparing analyses from what we are calling the unigram and unigram-reference models, and failed to find any overall effect of the non-linguistic context (although they did show that it improves the segmentation accuracy of referential words). [sent-240, score-0.639]
82 However, as the following table shows, we do see a marked 7 improvement in word segmentation f-score when we combine non-linguistic context with the more accurate collocation models. [sent-241, score-0.811]
83 Model unigram unigram-reference unigram-reference1 collocation collocation-reference collocation-reference1 collocation-referenceC1 word segmentation f-score 0. [sent-242, score-0.908]
84 750 We can also ask the converse question: does better word segmentation improve sentence referent identification? [sent-249, score-0.837]
85 Here we measure how well the models identify which object, if any, this sentence refers to, and does not directly evaluate word segmentation accuracy. [sent-250, score-0.775]
86 The baseline model here assigns each sentence the “null” ∅ object, achieving an accuracy of 0. [sent-251, score-0.197]
87 We can also measure the f-score with which the models identify non-∅ sentence referents; now the trivial baseline model achieves 0 f-score. [sent-254, score-0.197]
88 Model unigram unigram-reference unigram-reference1 collocation collocation-reference collocation-reference1 collocation-referenceC1 sentence referent accuracy 0. [sent-255, score-0.589]
89 747 We see a marked improvement in sentence referent accuracy and sentence referent f-score with the collocation-referenceC1 AG. [sent-267, score-0.518]
90 This is a single number that indicates how good the models are at identifying referring words and the words that they refer to. [sent-270, score-0.269]
91 Model unigram unigram-reference unigram-reference1 colloc collocation-reference collocation-reference1 collocation-referenceC1 topical word f-score 0 0. [sent-271, score-0.612]
92 636 Again, we find that the collocation-referenceC1 AG identifies referring words and the objects they refer to more accurately than the other models. [sent-276, score-0.235]
93 5 Conclusion This paper has used Adaptor Grammars (AGs) to formulate a variety of models that jointly segment utterances into words and identify the objects in the non-linguistic context that some of these words refer to. [sent-277, score-0.362]
94 The AGs differed in the kinds of generalisations they are capable of learning, and in the relationship between word segmentation and word reference that they assume. [sent-278, score-1.041]
95 An efficient, probabilistically sound algorithm for segmentation and word discovery. [sent-307, score-0.578]
96 Using speakers’ referential intentions to model early cross-situational word learning. [sent-311, score-0.494]
97 A Bayesian framework for word segmentation: Exploring the effects of context. [sent-321, score-0.382]
98 Using adaptor grammars to identifying synergies in the unsupervised acquisition of linguistic structure. [sent-340, score-0.699]
99 PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names. [sent-350, score-0.561]
100 Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. [sent-354, score-0.864]
wordName wordTfidf (topN-words)
[('word', 0.382), ('ag', 0.31), ('adaptor', 0.286), ('ao', 0.24), ('pig', 0.229), ('pcfg', 0.218), ('collocation', 0.199), ('sentence', 0.197), ('segmentation', 0.196), ('grammars', 0.188), ('ags', 0.153), ('acquisition', 0.151), ('unigram', 0.131), ('pcfgs', 0.131), ('al', 0.128), ('utterance', 0.119), ('referential', 0.112), ('bo', 0.11), ('grammar', 0.107), ('colloc', 0.099), ('nonterminals', 0.099), ('nonterminal', 0.098), ('words', 0.096), ('language', 0.082), ('phoneme', 0.08), ('gx', 0.08), ('labelled', 0.075), ('jones', 0.073), ('johnson', 0.073), ('frank', 0.062), ('objects', 0.062), ('apig', 0.062), ('referent', 0.062), ('similiar', 0.062), ('goldwater', 0.06), ('generates', 0.058), ('dog', 0.055), ('cfg', 0.055), ('interactive', 0.054), ('referents', 0.05), ('tx', 0.049), ('corpus', 0.046), ('mark', 0.045), ('reference', 0.045), ('linguistics', 0.044), ('collocations', 0.044), ('object', 0.044), ('expanding', 0.043), ('topic', 0.043), ('synergies', 0.04), ('refer', 0.039), ('association', 0.039), ('et', 0.038), ('referring', 0.038), ('parse', 0.038), ('sharon', 0.038), ('hx', 0.038), ('lda', 0.037), ('trees', 0.036), ('relationship', 0.036), ('segment', 0.035), ('tree', 0.035), ('context', 0.034), ('linguistic', 0.034), ('gs', 0.034), ('rule', 0.033), ('skeletal', 0.033), ('unsegmented', 0.033), ('brent', 0.033), ('pronunciations', 0.033), ('cx', 0.033), ('adapted', 0.031), ('synergistic', 0.03), ('parses', 0.03), ('informally', 0.03), ('north', 0.029), ('technologies', 0.029), ('dirichlet', 0.029), ('phonemes', 0.028), ('noun', 0.027), ('strings', 0.027), ('indicated', 0.026), ('xx', 0.026), ('bigram', 0.025), ('bevan', 0.025), ('bpig', 0.025), ('elman', 0.025), ('estes', 0.025), ('fernald', 0.025), ('infants', 0.025), ('macquarie', 0.025), ('nsw', 0.025), ('phonemicised', 0.025), ('tda', 0.025), ('tdx', 0.025), ('identi', 0.025), ('speech', 0.024), ('sentences', 0.024), ('null', 0.024), ('inference', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
2 0.2353427 285 nips-2010-Why are some word orders more common than others? A uniform information density account
Author: Luke Maurits, Dan Navarro, Amy Perfors
Abstract: Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an informationtheoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly nonoptimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work. 1
3 0.15789838 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
4 0.1556434 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
5 0.15346174 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
6 0.089927785 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
7 0.087344207 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
8 0.083299972 150 nips-2010-Learning concept graphs from text with stick-breaking priors
9 0.073825419 194 nips-2010-Online Learning for Latent Dirichlet Allocation
10 0.064638264 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
11 0.06231967 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
12 0.054656353 234 nips-2010-Segmentation as Maximum-Weight Independent Set
13 0.049099516 61 nips-2010-Direct Loss Minimization for Structured Prediction
14 0.048431139 125 nips-2010-Inference and communication in the game of Password
15 0.043949295 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
16 0.043753672 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
17 0.042934548 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
18 0.041813035 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
19 0.041495871 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
20 0.041293714 1 nips-2010-(RF)^2 -- Random Forest Random Field
topicId topicWeight
[(0, 0.103), (1, 0.04), (2, -0.033), (3, -0.042), (4, -0.201), (5, 0.05), (6, 0.075), (7, -0.018), (8, 0.005), (9, 0.051), (10, 0.094), (11, 0.033), (12, -0.084), (13, 0.05), (14, -0.015), (15, 0.017), (16, 0.114), (17, -0.011), (18, -0.086), (19, -0.125), (20, -0.03), (21, 0.041), (22, 0.121), (23, 0.06), (24, 0.28), (25, -0.207), (26, 0.096), (27, -0.072), (28, -0.012), (29, 0.034), (30, 0.075), (31, -0.016), (32, -0.133), (33, 0.062), (34, -0.079), (35, -0.125), (36, 0.091), (37, -0.06), (38, -0.075), (39, -0.105), (40, 0.098), (41, 0.063), (42, 0.05), (43, -0.06), (44, -0.059), (45, 0.082), (46, -0.003), (47, -0.003), (48, 0.095), (49, -0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.97989696 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
2 0.89706635 285 nips-2010-Why are some word orders more common than others? A uniform information density account
Author: Luke Maurits, Dan Navarro, Amy Perfors
Abstract: Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an informationtheoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly nonoptimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work. 1
3 0.70069432 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
4 0.62580448 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
5 0.56533545 125 nips-2010-Inference and communication in the game of Password
Author: Yang Xu, Charles Kemp
Abstract: Communication between a speaker and hearer will be most efficient when both parties make accurate inferences about the other. We study inference and communication in a television game called Password, where speakers must convey secret words to hearers by providing one-word clues. Our working hypothesis is that human communication is relatively efficient, and we use game show data to examine three predictions. First, we predict that speakers and hearers are both considerate, and that both take the other’s perspective into account. Second, we predict that speakers and hearers are calibrated, and that both make accurate assumptions about the strategy used by the other. Finally, we predict that speakers and hearers are collaborative, and that they tend to share the cognitive burden of communication equally. We find evidence in support of all three predictions, and demonstrate in addition that efficient communication tends to break down when speakers and hearers are placed under time pressure.
6 0.38671499 107 nips-2010-Global seismic monitoring as probabilistic inference
7 0.37376001 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
8 0.34913352 150 nips-2010-Learning concept graphs from text with stick-breaking priors
9 0.32838079 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
10 0.29883757 61 nips-2010-Direct Loss Minimization for Structured Prediction
11 0.28502989 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
12 0.27993914 234 nips-2010-Segmentation as Maximum-Weight Independent Set
13 0.26026571 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
14 0.25850236 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
15 0.24485971 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
16 0.2371489 1 nips-2010-(RF)^2 -- Random Forest Random Field
17 0.23648058 281 nips-2010-Using body-anchored priors for identifying actions in single images
18 0.22931384 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
19 0.22591037 153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process
20 0.21492784 256 nips-2010-Structural epitome: a way to summarize one’s visual experience
topicId topicWeight
[(13, 0.017), (22, 0.012), (27, 0.057), (30, 0.608), (45, 0.105), (50, 0.029), (52, 0.014), (77, 0.013), (78, 0.011), (90, 0.018)]
simIndex simValue paperId paperTitle
1 0.93948609 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
Author: Hariharan Narayanan, Sanjoy Mitter
Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1
same-paper 2 0.9178403 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
3 0.86293292 283 nips-2010-Variational Inference over Combinatorial Spaces
Author: Alexandre Bouchard-côté, Michael I. Jordan
Abstract: Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems [1, 2, 3], theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference [4]. Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models [5]; unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset [6]. 1
4 0.75469714 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
Author: Ronny Luss, Saharon Rosset, Moni Shahar
Abstract: A new algorithm for isotonic regression is presented based on recursively partitioning the solution space. We develop efficient methods for each partitioning subproblem through an equivalent representation as a network flow problem, and prove that this sequence of partitions converges to the global solution. These network flow problems can further be decomposed in order to solve very large problems. Success of isotonic regression in prediction and our algorithm’s favorable computational properties are demonstrated through simulated examples as large as 2 × 105 variables and 107 constraints.
5 0.7235108 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
Author: Tian Lan, Yang Wang, Weilong Yang, Greg Mori
Abstract: We propose a discriminative model for recognizing group activities. Our model jointly captures the group activity, the individual person actions, and the interactions among them. Two new types of contextual information, group-person interaction and person-person interaction, are explored in a latent variable framework. Different from most of the previous latent structured models which assume a predefined structure for the hidden layer, e.g. a tree structure, we treat the structure of the hidden layer as a latent variable and implicitly infer it during learning and inference. Our experimental results demonstrate that by inferring this contextual information together with adaptive structures, the proposed model can significantly improve activity recognition performance. 1
6 0.68009168 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
7 0.63596141 220 nips-2010-Random Projection Trees Revisited
8 0.55582619 222 nips-2010-Random Walk Approach to Regret Minimization
9 0.5424704 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
10 0.51999831 285 nips-2010-Why are some word orders more common than others? A uniform information density account
11 0.51977384 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
12 0.51618081 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
13 0.50894952 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
14 0.49463207 221 nips-2010-Random Projections for $k$-means Clustering
15 0.4880898 155 nips-2010-Learning the context of a category
16 0.48329553 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
17 0.47262725 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
18 0.47204566 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
19 0.47146735 233 nips-2010-Scrambled Objects for Least-Squares Regression
20 0.47068417 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points