nips nips2007 nips2007-210 knowledge-graph by maker-knowledge-mining

210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Source: pdf

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

Abstract: In online handwriting recognition the trajectory of the pen is recorded during writing. Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. However these techniques require considerable human effort, and are speciﬁc to particular languages and alphabets. This paper describes a system capable of directly transcribing raw online handwriting data. The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract In online handwriting recognition the trajectory of the pen is recorded during writing. [sent-8, score-0.724]

2 Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. [sent-9, score-0.22]

3 Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. [sent-10, score-0.186]

4 This paper describes a system capable of directly transcribing raw online handwriting data. [sent-12, score-0.596]

5 The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. [sent-13, score-0.528]

6 In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. [sent-14, score-0.486]

7 In online handwriting the location of the pen-tip on a surface is recorded at regular intervals, and the task is to map from the sequence of pen positions to the sequence of words. [sent-17, score-0.786]

8 At ﬁrst sight, it would seem straightforward to label raw online inputs directly. [sent-18, score-0.387]

9 However, the fact that each letter or word is distributed over many pen positions poses a problem for conventional sequence labelling algorithms, which have difﬁculty processing data with long-range interdependencies. [sent-19, score-0.612]

10 The issue of classifying preprocessed versus raw data has broad relevance to machine learning, and merits further discussion. [sent-25, score-0.322]

11 However, there are three points to consider in favour of raw data. [sent-27, score-0.183]

12 For example, features designed 1 for English handwriting could not be applied to languages with substantially different alphabets, such as Arabic or Chinese. [sent-30, score-0.337]

13 In contrast, a system trained directly on pen movements could be applied to any alphabet. [sent-31, score-0.279]

14 Thirdly, using raw data allows feature extraction to be built into the classiﬁer, and the whole system to be trained together. [sent-32, score-0.242]

15 For example, convolutional neural networks [10], in which a globally trained hierarchy of network layers is used to extract progressively higher level features, have proved effective at classifying raw images, such as objects in cluttered scenes or isolated handwritten characters [15, 11]. [sent-33, score-0.392]

16 In this paper, we apply a recurrent neural network (RNN) to online handwriting recognition. [sent-35, score-0.548]

17 The RNN uses the recently introduced connectionist temporal classiﬁcation output layer [2], which was speciﬁcally designed for labelling unsegmented sequence data. [sent-37, score-0.52]

18 An algorithm is introduced for applying grammatical constraints to the network outputs, thereby providing word level transcriptions. [sent-38, score-0.255]

19 The performance of the RNN system using both raw and preprocessed input data is compared to that of an HMM based system using preprocessed data only [13]. [sent-40, score-0.581]

20 To the best of our knowledge, this is the ﬁrst time whole sentences of unconstrained handwriting have been directly transcribed from raw online data. [sent-41, score-0.623]

21 Section 2 describes the network architecture, the output layer and the algorithm for applying grammatical constraints. [sent-42, score-0.267]

22 The problem is that the inﬂuence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the recurrent connections. [sent-47, score-0.226]

23 2 Connectionist Temporal Classiﬁcation Connectionist temporal classiﬁcation (CTC) [2] is an objective function designed for sequence labelling with RNNs. [sent-57, score-0.286]

24 Instead, it trains the network to map directly from input sequences to the conditional probabilities of the possible labellings. [sent-59, score-0.242]

25 The combined output sequence estimates the joint probability of all possible alignments of the input sequence with all possible labellings. [sent-64, score-0.288]

26 The probability of a particular labelling can then be estimated by summing over the probabilities of all the alignments that correspond to it. [sent-65, score-0.216]

27 More precisely, for an input sequence x of length T , choosing a label (or blank) at every time step according to the probabilities implied by the network outputs deﬁnes a probability distribution 2 T over the set of length T sequences of labels and blanks. [sent-66, score-0.517]

28 Assuming that the label probabilities at each time step are conditionally independent given x, the conditional T probability of a path π ∈ L is given by T t yπt , p(π|x) = (1) t=1 t where yk is the activation of output unit k at time t. [sent-69, score-0.222]

29 For example, both B(a, −, a, b, −) and B(−, a, a, −, −, a, b, b) yield the labelling (a,a,b). [sent-72, score-0.181]

30 Since the paths are mutually exclusive, the conditional probability of a given labelling l ∈ L≤T is the sum of the probabilities of all paths corresponding to it: p(l|x) = p(π|x). [sent-73, score-0.324]

31 To allow for blanks in the output paths, for each label sequence l ∈ L≤T consider a modiﬁed label ≤T sequence l ∈ L , with blanks added to the beginning and the end and inserted between every pair of labels. [sent-75, score-0.478]

32 For a labelling l, deﬁne the forward variable αt (s) as the summed probability of all paths whose length t preﬁxes are mapped by B onto the length s/2 preﬁx of l, i. [sent-77, score-0.309]

33 The label sequence probability is given by the sum of the products of the forward and backward variables at any time step: |l | p(l|x) = αt (s)βt (s). [sent-84, score-0.181]

34 (5) s=1 The objective function for CTC is the negative log probability of the network correctly labelling the entire training set. [sent-85, score-0.266]

35 Let S be a training set, consisting of pairs of input and target sequences (x, z), where target sequence z is at most as long as input sequence x. [sent-86, score-0.363]

36 (6) (x,z)∈S The network can be trained with gradient descent by differentiating OCT C with respect to the outputs, then using backpropagation through time to differentiate with respect to the network weights. [sent-88, score-0.2]

37 Noting that the same label (or blank) may be repeated several times for a single labelling l, we deﬁne the set of positions where label k occurs as lab(l, k) = {s : ls = k}, which may be empty. [sent-89, score-0.362]

38 s∈lab(z,k) (7) Once the network is trained, we would ideally label some unknown input sequence x by choosing the most probable labelling l∗ : l∗ = arg max p(l|x). [sent-91, score-0.514]

39 (8) l Using the terminology of HMMs, we refer to the task of ﬁnding this labelling as decoding. [sent-92, score-0.181]

40 For example, in speech and handwriting recognition, the ﬁnal transcriptions are usually required to form sequences of dictionary words. [sent-99, score-0.472]

41 We can express these constraints by altering the probabilities in (8) to be conditioned on some probabilistic grammar G, as well as the input sequence x: l∗ = arg max p(l|x, G). [sent-101, score-0.244]

42 Since the network attempts to model the probability of the whole labelling at once, there is nothing to stop it from learning inter-label transitions direct from the data, which would then be skewed by the external grammar. [sent-105, score-0.266]

43 Therefore as long as G focuses on long range label interactions (such as the probability of one word following another when the outputs are letters) it doesn’t interfere with the dependencies modelled by CTC. [sent-107, score-0.364]

44 That assumption is in general false, since both the input sequences and p(l) the grammar depend on the underlying generator of the data, for example the language being spoken. [sent-110, score-0.276]

45 Finally, if we assume that all label sequences are equally probable prior to any knowledge about the input or the grammar, we can drop the p(l) term in the denominator to get l∗ = arg max p(l|x)p(l|G). [sent-112, score-0.235]

46 We now describe an algorithm, based on the token passing algorithm for HMMs [16], that allows us to ﬁnd an approximate solution to (11) for a simple grammar. [sent-114, score-0.278]

47 Let G consist of a dictionary D containing W words, and a set of W 2 bigrams p(w|w) that deﬁne ˆ the probability of making a transition from word w to word w. [sent-115, score-0.376]

48 The probability of any labelling that ˆ does not form a sequence of dictionary words is 0. [sent-116, score-0.313]

49 For each word w, deﬁne the modiﬁed word w as w with blanks added at the beginning and end and between each pair of labels. [sent-117, score-0.331]

50 Deﬁne a token tok = (score, history) to be a pair consisting of a real valued score and a history of previously visited words. [sent-119, score-0.797]

51 In fact, 4 each token corresponds to a particular path through the network outputs, and its score is the log probability of that path. [sent-120, score-0.384]

52 The basic idea of the token passing algorithm is to pass along the highest scoring tokens at every word state, then maximise over these to ﬁnd the highest scoring tokens at the next state. [sent-121, score-0.782]

53 The transition probabilities are used when a token is passed from the last state in one word to the ﬁrst state in another. [sent-122, score-0.418]

54 The output word sequence is given by the history of the highest scoring end-of-word token at the ﬁnal time step. [sent-123, score-0.672]

55 At every time step t of the length T output sequence, each segment s of each modiﬁed word w holds a single token tok(w, s, t). [sent-124, score-0.5]

56 This is the highest scoring token reaching that segment at that time. [sent-125, score-0.381]

57 In addition we deﬁne the input token tok(w, 0, t) to be the highest scoring token arriving at word w at time t, and the output token tok(w, −1, t) to be the highest scoring token leaving word w at time t. [sent-126, score-1.67]

58 history + w for segment s = 1 to |w | do P = {tok(w, s, t − 1), tok(w, s − 1, t − 1)} if ws = blank and s > 2 and ws−2 = ws then add tok(w, s − 2, t − 1) to P tok(w, s, t) = token in P with highest score t tok(w, s, t). [sent-132, score-0.522]

59 score += ln(yws ) tok(w, −1, t) = highest scoring of {tok(w, |w |, t), tok(w, |w | − 1, t)} Termination: ﬁnd output token tok ∗ (w, −1, T ) with highest score at time T output tok ∗ (w, −1, T ). [sent-133, score-1.706]

60 However, because the output tokens tok(w, −1, T ) are sorted in order of score, the search can be terminated when a token is reached whose score is less than the current best score with the transition included. [sent-135, score-0.48]

61 If no bigrams are used, lines 14-16 can be replaced by a simple search for the highest scoring output token, and the complexity reduces to O(T W ). [sent-137, score-0.257]

62 Any HMM decoding algorithm could be applied to CTC outputs in the same way as token passing. [sent-144, score-0.349]

63 5 3 Experiments The experimental task was online handwriting recognition, using the IAM-OnDB handwriting database [12], which is available for public download from http://www. [sent-146, score-0.719]

64 ch/ fki/iamondb/ For CTC, we record both the character error rate, and the word error rate using Algorithm 1 with a language model and a dictionary. [sent-149, score-0.285]

65 Both the character and word error rate are deﬁned as the total number of insertions, deletions and substitutions in the algorithm’s transcription of test set, divided by the combined length of the target transcriptions in the test set. [sent-151, score-0.29]

66 We compare results using both raw inputs direct from the pen sensor, and a preprocessed input representation designed for HMMs. [sent-152, score-0.69]

67 1 Data and Preprocessing IAM-OnDB consists of pen trajectories collected from 221 different writers using a ‘smart whiteboard’ [12]. [sent-154, score-0.257]

68 The writers were asked to write forms from the LOB text corpus [8], and the position of their pen was tracked using an infra-red device in the corner of the board. [sent-155, score-0.257]

69 The input data consisted of the x and y pen coordinates, the points in the sequence when individual strokes (i. [sent-156, score-0.399]

70 periods when the pen is pressed against the board) end, and the times when successive position measurements were made. [sent-158, score-0.22]

71 The data sets contained a total of 3,298,424, 885,964, 1,036,803 and 2,425,5242 pen coordinates respectively. [sent-161, score-0.29]

72 We refer to this as the raw input representation. [sent-169, score-0.245]

73 We refer to this as the preprocessed input representation. [sent-171, score-0.201]

74 Brieﬂy, in order to account for the variance in writing styles, the pen trajectories were normalised with respect to such properties as the slant, skew and width of the letters, and the slope of the line as a whole. [sent-172, score-0.261]

75 Two sets of input features were then extracted, the ﬁrst consisting of ‘online’ features, such as pen position, pen speed, line curvature etc. [sent-173, score-0.502]

76 The input layer was fully connected to the hidden layers, which were fully connected to themselves and the output layer. [sent-179, score-0.212]

77 The output layer contained 81 units (80 characters plus the blank label). [sent-180, score-0.318]

78 For the raw input representation, there were 4 input units and a total of 100,881 weights. [sent-181, score-0.307]

79 For the preprocessed representation, there were 25 inputs and 117,681 weights. [sent-182, score-0.193]

80 The network was trained with online gradient descent, using a learning rate of 10−4 and a momentum of 0. [sent-185, score-0.194]

81 01) System HMM CTC CTC CTC CTC Input preprocessed raw preprocessed raw preprocessed LM WER 35. [sent-193, score-0.783]

82 All parameters, including the word insertion penalty and the grammar scale factor, were optimised on the validation set. [sent-203, score-0.241]

83 3 Results The character error rate for the CTC network with the preprocessed inputs was 11. [sent-205, score-0.345]

84 From Table 1 we can see that with a dictionary and a language model this translates into a mean word error rate of 20. [sent-208, score-0.277]

85 With the raw input data CTC achieved a character error rate of 13. [sent-213, score-0.312]

86 1%, and word error rates that were close to those recorded with the preprocessed data, particularly when the language model was present. [sent-215, score-0.393]

87 The key difference between the input representations is that the raw data is less localised, and therefore requires more use of context. [sent-216, score-0.245]

88 A useful indication of the network’s sensitivity to context is t provided by the derivatives of the output yk at a particular point t in the data sequence with respect t to the inputs xk at all points 1 ≤ t ≤ T . [sent-217, score-0.236]

89 We have applied this system to an online handwriting database and obtained results that substantially improve on a state-of-the-art HMM based system. [sent-221, score-0.443]

90 We have also shown that the network’s performance with raw sensor inputs is comparable to that with sophisticated preprocessing. [sent-222, score-0.237]

91 As far as we are aware, our system is the ﬁrst to successfully recognise unconstrained online handwriting using raw inputs only. [sent-223, score-0.706]

92 Connectionist temporal classiﬁcation: Labelling a unsegmented sequence data with recurrent neural networks. [sent-238, score-0.184]

93 Framewise phoneme classiﬁcation with bidirectional LSTM and other neural network architectures. [sent-246, score-0.217]

94 Writer independent on-line handwriting recognition using an HMM approach. [sent-272, score-0.389]

95 7 Figure 1: Sequential Jacobian for an excerpt from the IAM-OnDB, with raw inputs (left) and preprocessed inputs (right). [sent-274, score-0.43]

96 The reconstructed image was created by plotting the pen coordinates recorded by the sensor. [sent-276, score-0.293]

97 For the preprocessed data, the Jacobian is sharply peaked around the time when the output is emitted. [sent-280, score-0.219]

98 For the raw data it is more spread out, suggesting that the network makes more use of long-range context. [sent-281, score-0.268]

99 Note the spike in sensitivity to the very end of the raw input sequence: this corresponds to the delayed dot of the ‘i’. [sent-282, score-0.274]

100 A novel approach to on-line a handwriting recognition based on bidirectional long short-term memory networks. [sent-341, score-0.587]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tok', 0.498), ('ctc', 0.314), ('handwriting', 0.305), ('token', 0.245), ('pen', 0.22), ('raw', 0.183), ('labelling', 0.181), ('preprocessed', 0.139), ('word', 0.138), ('bidirectional', 0.132), ('rnn', 0.111), ('hmm', 0.091), ('blank', 0.088), ('network', 0.085), ('recognition', 0.084), ('language', 0.08), ('output', 0.08), ('online', 0.079), ('recurrent', 0.079), ('scoring', 0.075), ('graves', 0.074), ('liwicki', 0.074), ('lstm', 0.074), ('grammar', 0.074), ('sequence', 0.073), ('label', 0.071), ('layer', 0.07), ('character', 0.067), ('input', 0.062), ('highest', 0.061), ('sequences', 0.06), ('dictionary', 0.059), ('outputs', 0.057), ('unconstrained', 0.056), ('blanks', 0.055), ('blstm', 0.055), ('bunke', 0.055), ('labellings', 0.055), ('lob', 0.055), ('ndez', 0.055), ('jacobian', 0.055), ('inputs', 0.054), ('score', 0.054), ('paths', 0.054), ('connectionist', 0.052), ('fern', 0.048), ('localised', 0.048), ('transcriptions', 0.048), ('hmms', 0.047), ('tokens', 0.047), ('characters', 0.047), ('layers', 0.047), ('decoding', 0.047), ('switzerland', 0.045), ('oct', 0.044), ('strokes', 0.044), ('ln', 0.043), ('probable', 0.042), ('document', 0.041), ('normalised', 0.041), ('bigrams', 0.041), ('ine', 0.041), ('ls', 0.039), ('length', 0.037), ('hochreiter', 0.037), ('rnns', 0.037), ('santiago', 0.037), ('snf', 0.037), ('tum', 0.037), ('wer', 0.037), ('writers', 0.037), ('ws', 0.037), ('architectures', 0.037), ('coordinates', 0.037), ('backward', 0.037), ('activation', 0.036), ('recorded', 0.036), ('probabilities', 0.035), ('sa', 0.034), ('memory', 0.033), ('contained', 0.033), ('long', 0.033), ('passing', 0.033), ('designed', 0.032), ('vanishing', 0.032), ('unsegmented', 0.032), ('interfere', 0.032), ('bern', 0.032), ('idsia', 0.032), ('grammatical', 0.032), ('letters', 0.031), ('trained', 0.03), ('database', 0.03), ('english', 0.03), ('xes', 0.029), ('lm', 0.029), ('sight', 0.029), ('optimised', 0.029), ('system', 0.029), ('sensitivity', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

2 0.12116959 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Author: Kristina Toutanova, Mark Johnson

Abstract: We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words’ distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset. 1

3 0.096388385 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements

Author: Ben Williams, Marc Toussaint, Amos J. Storkey

Abstract: Biological movement is built up of sub-blocks or motion primitives. Such primitives provide a compact representation of movement which is also desirable in robotic control applications. We analyse handwriting data to gain a better understanding of primitives and their timings in biological movements. Inference of the shape and the timing of primitives can be done using a factorial HMM based model, allowing the handwriting to be represented in primitive timing space. This representation provides a distribution of spikes corresponding to the primitive activations, which can also be modelled using HMM architectures. We show how the coupling of the low level primitive model, and the higher level timing model during inference can produce good reconstructions of handwriting, with shared primitives for all characters modelled. This coupled model also captures the variance proﬁle of the dataset which is accounted for by spike timing jitter. The timing code provides a compact representation of the movement while generating a movement without an explicit timing model produces a scribbling style of output. 1

4 0.091310479 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

Author: Fred Richardson, William M. Campbell

Abstract: Many tasks in speech processing involve classiﬁcation of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to ﬁrst convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating ﬁlter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and signiﬁcant qualitative and quantitative results.

5 0.086159825 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

6 0.07521192 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

7 0.068067856 9 nips-2007-A Probabilistic Approach to Language Change

8 0.064974189 197 nips-2007-The Infinite Markov Model

9 0.06325046 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

10 0.063233241 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence

11 0.059781581 115 nips-2007-Learning the 2-D Topology of Images

12 0.058941983 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

13 0.057529304 183 nips-2007-Spatial Latent Dirichlet Allocation

14 0.055150546 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

15 0.052552231 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

16 0.052357726 22 nips-2007-Agreement-Based Learning

17 0.050819747 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

18 0.04932167 182 nips-2007-Sparse deep belief net model for visual area V2

19 0.048599858 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

20 0.048523594 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, 0.049), (2, 0.01), (3, -0.124), (4, 0.038), (5, -0.003), (6, 0.027), (7, -0.015), (8, 0.024), (9, -0.007), (10, 0.041), (11, 0.016), (12, -0.034), (13, -0.065), (14, -0.029), (15, -0.144), (16, -0.195), (17, -0.022), (18, -0.012), (19, 0.039), (20, 0.018), (21, -0.008), (22, 0.041), (23, -0.003), (24, -0.035), (25, -0.092), (26, 0.015), (27, 0.049), (28, -0.121), (29, -0.077), (30, 0.018), (31, -0.067), (32, -0.001), (33, 0.1), (34, 0.013), (35, -0.069), (36, -0.021), (37, -0.162), (38, 0.059), (39, -0.018), (40, 0.013), (41, 0.071), (42, 0.03), (43, 0.07), (44, -0.126), (45, -0.015), (46, 0.005), (47, 0.031), (48, -0.093), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93744832 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

2 0.66058654 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for deﬁning probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efﬁcient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1

3 0.60578489 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

Author: Fred Richardson, William M. Campbell

4 0.57179916 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements

Author: Ben Williams, Marc Toussaint, Amos J. Storkey

5 0.51025873 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and ﬁnd it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difﬁculty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na ﬁvy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must ﬁrst learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must ﬁrst have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The ﬁrst involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We ﬁrst present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be reﬁned by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by ﬁrst choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artiﬁcial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difﬁcult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reﬂected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the ﬁles me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classiﬁcation tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reﬂect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

6 0.49742988 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

7 0.49700412 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

8 0.47776479 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence

9 0.45089018 84 nips-2007-Expectation Maximization and Posterior Constraints

10 0.44415259 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

11 0.43061116 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

12 0.3895542 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

13 0.36606398 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

14 0.3513917 197 nips-2007-The Infinite Markov Model

15 0.34571686 29 nips-2007-Automatic Generation of Social Tags for Music Recommendation

16 0.33745554 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

17 0.33341524 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

18 0.32931352 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

19 0.32297367 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

20 0.32172817 22 nips-2007-Agreement-Based Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.026), (13, 0.082), (16, 0.046), (18, 0.021), (19, 0.012), (21, 0.033), (31, 0.014), (34, 0.011), (35, 0.028), (47, 0.098), (49, 0.375), (83, 0.08), (85, 0.026), (87, 0.028), (90, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8295123 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is signiﬁcantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

2 0.82609046 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui

Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against ﬁxed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artiﬁcial and real datasets is provided. 1

same-paper 3 0.80176097 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

4 0.73688722 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classiﬁer. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

5 0.47250351 7 nips-2007-A Kernel Statistical Test of Independence

Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola

Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically signiﬁcant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

6 0.44306314 43 nips-2007-Catching Change-points with Lasso

7 0.44011417 185 nips-2007-Stable Dual Dynamic Programming

8 0.43942079 108 nips-2007-Kernel Measures of Conditional Dependence

9 0.43819085 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

10 0.43516728 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.42720011 40 nips-2007-Bundle Methods for Machine Learning

12 0.42161363 24 nips-2007-An Analysis of Inference with the Universum

13 0.41986421 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

14 0.41819155 195 nips-2007-The Generalized FITC Approximation

15 0.41712481 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence

16 0.41475731 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

17 0.41443503 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

18 0.41271794 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

19 0.41223273 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

20 0.4105058 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI