jmlr jmlr2011 jmlr2011-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
Reference: text
sentIndex sentText sentNum sentScore
1 1 Part-Of-Speech Tagging POS aims at labeling each word with a unique tag that indicates its syntactic role, for example, plural noun, adverb, . [sent-64, score-0.416]
2 2 Chunking Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic constituents such as noun or verb phrases (NP or VP). [sent-112, score-0.583]
3 Each SVM was trained in a pairwise classification manner, and fed with a window around the word of interest containing POS and words as features, as well as surrounding tags. [sent-124, score-0.527]
4 They use POS features coming from an external tagger, as well carefully hand-crafted specialization features which again change the data representation by concatenating some (carefully chosen) chunk tags or some words with their POS representation. [sent-140, score-0.445]
5 As in the chunking task, each word is assigned a tag prefixed by an indicator of the beginning or the inside of an entity. [sent-144, score-0.448]
6 , 2005) formalism one assigns roles ARG0-5 to words that are arguments of a verb (or more technically, a predicate) in the sentence, for example, the following sentence might be tagged “[John]ARG0 [ate]REL [the apple]ARG1 ”, where “ate” is the predicate. [sent-178, score-0.389]
7 (2004) take these base features and define additional features, notably the part-ofspeech tag of the head word, the predicted named entity class of the argument, features providing word sense disambiguation for the verb (they add 25 variants of 12 new feature types overall). [sent-189, score-0.496]
8 The traditional NLP approach is: extract from the sentence a rich set of hand-designed features which are then fed to a standard classification algorithm, for example, a Support Vector Machine (SVM), often with a linear kernel. [sent-236, score-0.393]
9 The architecture takes the input sentence and learns several layers of feature extraction that process the inputs. [sent-271, score-0.423]
10 However, the first layer of our network maps each of these word indices into a feature vector, by a lookup table operation. [sent-327, score-0.69]
11 Given a task of interest, a relevant representation of each word is then given by the corresponding lookup table feature vector, which is trained by backpropagation, starting from a random initialization. [sent-328, score-0.575]
12 Our architecture allow us to take advantage of better trained word representations, by simply initializing the word lookup table with these representations (instead of randomly). [sent-330, score-0.85]
13 Given a sentence or any sequence of T words [w]T in D , the lookup table layer applies the same operation for each word 1 in the sequence, producing the following output matrix: LTW ([w]T ) = 1 W 1 [w]1 W 1 [w]2 . [sent-332, score-0.929]
14 2501 C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA k word w, a feature vector of dimension dwrd = ∑k dwrd is then obtained by concatenating all lookup table outputs: W1 11 LTW 1 (w1 ) w . [sent-351, score-0.528]
15 K 1 W wK LTW K (wK ) The matrix output of the lookup table layer for a sequence of words [w]T is then similar to (1), but 1 where extra rows have been added for each discrete feature: W 1 1 1] . [sent-361, score-0.451]
16 3 Extracting Higher Level Features from Word Feature Vectors Feature vectors produced by the lookup table layer need to be combined in subsequent layers of the neural network to produce a tag decision for each word in the sentence. [sent-380, score-0.871]
17 Producing tags for each element in variable length sequences (here, a sentence is a sequence of words) is a standard problem in machine-learning. [sent-381, score-0.408]
18 We consider two common approaches which tag one word at the time: a window approach, and a (convolutional) sentence approach. [sent-382, score-0.677]
19 1 W INDOW A PPROACH A window approach assumes the tag of a word depends mainly on its neighboring words. [sent-385, score-0.407]
20 Each word in the window is first passed through the lookup table layer (1) or (2), producing a matrix of word features of fixed size dwrd × ksz . [sent-387, score-0.988]
21 More formally, the word feature window given by the first network layer can be written as: W 1 [w]t−d /2 win . [sent-389, score-0.542]
22 Finally, the output size of the last layer L of our network is equal to the number of possible tags for the task of interest. [sent-406, score-0.411]
23 To circumvent this problem, we augment the sentence with a special “PADDING” word replicated dwin /2 times at the beginning and the end. [sent-409, score-0.589]
24 It successively takes the complete sentence, passes it through the lookup table layer (1), produces local features around each word of the sentence thanks to convolutional layers, combines these feature into a global feature vector which can then be fed to standard affine layers (4). [sent-420, score-1.135]
25 In the semantic role labeling case, this operation is performed for each word in the sentence, and for each verb in the sentence. [sent-421, score-0.392]
26 It is thus necessary to encode in the network architecture which verb we are considering in the sentence, and which word we want to tag. [sent-422, score-0.431]
27 For that purpose, each word at position i in the sentence is augmented with two features in the way described in Section 3. [sent-423, score-0.516]
28 These features encode the relative distances i − posv and i − posw with respect to the chosen verb at position posv , and the word to tag at position posw respectively. [sent-426, score-0.414]
29 A convolutional layer can be seen as a generalization of a window apl−1 proach: given a sequence represented by columns in a matrix fθ (in our lookup table matrix (1)), a matrix-vector operation as in (4) is applied to each window of successive windows in the sequence. [sent-428, score-0.65]
30 n te of ss le d an r te l a ns tio op es of rcis e ex rt po re es iv to cut e ex w lo a l ld ou w so s a l nge a ed ch os op e pr xx xx xx xx xx xx xx xx xx xx x xx xx xxxx xx xx xxxx xxx xx x xxxxxx xxx xx xx xxxxx xxxxxx xxx xxxx xxxx xx xxxxxxxxx xxxxxx xxx x xxxx xxxx xx Th . [sent-430, score-2.909]
31 We consider a sentence approach network (Figure 2) trained for SRL. [sent-432, score-0.437]
32 It is interesting to see that the network catches features mostly around the verb of interest (here “report”) and word of interest (“proposed” (left) or “often” (right)). [sent-435, score-0.396]
33 The size of the output (6) depends on the number of words in the sentence fed to the network. [sent-441, score-0.407]
34 Local feature vectors extracted by the convolutional layers have to be combined to obtain a global feature vector, with a fixed size independent of the sentence length, in order to apply subsequent standard affine layers. [sent-442, score-0.405]
35 ) The average operation does not make much sense in our case, as in general most words in the sentence do not have any influence on the semantic role of a given word to tag. [sent-445, score-0.591]
36 Given a matrix fθ output by a convolutional layer l − 1, the Max layer l outputs a l vector fθ : l−1 l fθ = max fθ i t i,t 1 ≤ i ≤ nl−1 . [sent-447, score-0.389]
37 Each word in a segment labeled “X” is tagged with a prefixed label, depending of the word position in the segment (begin, inside, end). [sent-453, score-0.416]
38 In the window approach, these tags apply to the word located in the center of the window. [sent-460, score-0.444]
39 In the (convolutional) sentence approach, these tags apply to the word designated by additional markers in the network input. [sent-461, score-0.699]
40 1 W ORD -L EVEL L OG -L IKELIHOOD In this approach, each word in a sentence is considered independently. [sent-487, score-0.478]
41 (11) j While this training criterion, often referred as cross-entropy is widely used for classification problems, it might not be ideal in our case, where there is often a correlation between the tag of a word in a sentence and its neighboring tags. [sent-492, score-0.616]
42 2 S ENTENCE -L EVEL L OG -L IKELIHOOD In tasks like chunking, NER or SRL we know that there are dependencies between word tags in a sentence: not only are tags organized in chunks, but some tags cannot follow other tags. [sent-496, score-0.665]
43 We introduce a transition with parameters θ, for the sentence [x]1 score [A]i, j for jumping from i to j tags in successive words, and an initial score [A]i, 0 for starting from the ith tag. [sent-502, score-0.482]
44 The score of a sentence [x]T along a path of tags [i]T is then given 1 1 by the sum of transition scores and network scores: T ˜ s([x]T , [i]T , θ) = ∑ [A][i] 1 1 t=1 t−1 , [i]t + [ fθ ][i]t ,t . [sent-504, score-0.528]
45 1 1 At inference time, given a sentence [x]T to tag, we have to find the best tag path which minimizes 1 the sentence score (12). [sent-510, score-0.678]
46 Non-differentiable points arise because we use a “hard” transfer function (5) and because we use a “max” layer (7) in the sentence approach network. [sent-536, score-0.423]
47 We employed only two of them: the initialization and update of the parameters of each network layer were done according to the “fan-in” of the layer, that is the number of inputs used to compute each output of this layer (Plaut and Hinton, 1987). [sent-549, score-0.389]
48 The fan-in for the lookup table (1), the l th linear layer (4) and the convolution layer (6) l−1 are respectively 1, nhu and dwin × nl−1 . [sent-550, score-0.663]
49 The SRL task was trained using the sentence approach (Section 3. [sent-557, score-0.391]
50 Additionally, all occurrences of sequences of numbers within a word are replaced with the string “NUMBER”, so for example both the words “PS1” and “PS2” would map to the single word “psNUMBER”. [sent-596, score-0.468]
51 The capacity of our network architectures lies mainly in the word lookup table, which contains 50 × 100, 000 parameters to train. [sent-604, score-0.502]
52 NAMSOS SHIRT MAHAN NILGIRIS Table 6: Word embeddings in the word lookup table of a SRL neural network trained from scratch, with a dictionary of size 100, 000. [sent-609, score-0.821]
53 Ideally, we would like semantically similar words to be close in the embedding space represented by the word lookup table: by continuity of the neural network function, tags produced on semantically similar sentences would be similar. [sent-612, score-0.692]
54 We will focus in the next section on improving these word embeddings by leveraging unlabeled data. [sent-614, score-0.399]
55 Lots of Unlabeled Data We would like to obtain word embeddings carrying more syntactic and semantic information than shown in Table 6. [sent-626, score-0.464]
56 We then use these improved embeddings to initialize the word lookup tables of the networks described in Section 3. [sent-641, score-0.606]
57 Their goal was however to perform well on some tagging task on fully unsupervised data, rather than obtaining generic word embeddings useful for other tasks. [sent-715, score-0.466]
58 In our case, possible parameters to adjust are: the learning rate λ, the word embedding dimensions d, number of hidden units n1 and input hu window size dwin . [sent-731, score-0.487]
59 4 Embeddings Both networks produce much more appealing word embeddings than in Section 3. [sent-749, score-0.395]
60 Our approach simply consists of initializing the word lookup tables of the supervised networks with the embeddings computed by the language models. [sent-772, score-0.708]
61 Starting from a couple of elementary sentence forms, sentences are described by the successive application of sentence transformation operators. [sent-827, score-0.54]
62 ” These gradings are then used to compare sentence forms: “It now turns out that, given the graded n-tuples of words for a particular sentence form, we can find other sentences forms of the same word classes in which the same n-tuples of words produce the same grading of sentences. [sent-835, score-0.852]
63 On the other hand, the structure of our language models is probably too restrictive for such goals, and our current approach only exploits the word embeddings discovered during training. [sent-839, score-0.454]
64 As we mentioned earlier, SRL can be trained only with the sentence approach network, due to long-range dependencies related to the verb predicate. [sent-881, score-0.421]
65 We thus performed additional experiments, where all four tasks were trained using the sentence approach network. [sent-882, score-0.397]
66 The parameters of the first linear layers (4) were shared in the window approach case (see Figure 5), and the first the convolution layer parameters (6) were shared in the sentence approach networks. [sent-884, score-0.601]
67 We used the hu same architecture as SRL for the sentence approach network. [sent-886, score-0.413]
68 It is worth mentioning that MTL can produce a single unified network that performs well for all these tasks using the sentence approach. [sent-903, score-0.396]
69 However this unified network only leads to marginal improvements over using a separate network for each task: the most important MTL task appears to be the unsupervised learning of the word embeddings. [sent-904, score-0.411]
70 The baseline results in Table 9 also show that using the sentence approach for the POS, Chunking, and NER tasks yields no performance improvement (or degradation) over the window approach. [sent-906, score-0.411]
71 We trained POS, CHUNK NER in a MTL way, both for the window and sentence network approaches. [sent-933, score-0.535]
72 As a baseline, we show previous results of our window approach system, as well as additional results for our sentence approach system, when trained separately on each task. [sent-935, score-0.452]
73 The POS network was trained with two character word suffixes; the NER network was trained using the small CoNLL 2003 gazetteer; the CHUNK and NER networks were trained with additional POS features; and finally, the SRL network was trained with additional CHUNK features. [sent-952, score-0.836]
74 We trained a NER network with 4 additional word features indicating (feature “on” or “off”) whether the word is found in the gazetteer under one of these four categories. [sent-973, score-0.695]
75 If a sentence chunk is found in the gazetteer, then all words in the chunk have their corresponding gazetteer feature turned to “on”. [sent-975, score-0.754]
76 A plausible explanation of this large boost over the network using only the language model is that gazetteers include word chunks, while we use only the word representation of our language model. [sent-977, score-0.703]
77 We also report results obtained for the SRL task by adding word features representing the CHUNK tags (also provided by the CoNLL challenge). [sent-990, score-0.421]
78 b-vp i-vp i-vp i-vp i-vp e-vp Figure 6: Charniak parse tree for the sentence “The luxury auto maker last year sold 1,214 cars in the U. [sent-1046, score-0.451]
79 Considering that a node in a syntactic parse tree assigns a label to a segment of the parsed sentence, we propose a way to feed (partially) this labeled segmentation to our network, through additional lookup tables. [sent-1058, score-0.443]
80 Each of these lookup tables encode labeled segments of each parse tree level (up to a certain depth). [sent-1059, score-0.392]
81 The lookup table for level 0 encodes the corresponding IOBES phrase tags for each words. [sent-1067, score-0.412]
82 Experiments were performed using the LM2 language model using the same network architectures (see Table 5) and using additional lookup tables of dimension 5 for each parse tree level. [sent-1070, score-0.577]
83 It is surprising to observe that adding chunking features into the semantic role labeling network performs significantly worse than adding features describing the level 0 of the Charniak parse tree (Table 12). [sent-1111, score-0.596]
84 However, the parse trees identify leaf sentence segments that are often smaller than those identified by the chunking tags, as shown by Hollingshead et al. [sent-1113, score-0.556]
85 (2005), consider the sentence and chunk labels “(NP They) (VP are starting to buy) (NP growth stocks)”. [sent-1121, score-0.449]
86 6 Word Representations We have described how we induced useful word embeddings by applying our architecture to a language modeling task trained using a large amount of unlabeled text data. [sent-1126, score-0.695]
87 However, word representations are perhaps more commonly inferred from n-gram language modeling rather than purely continuous language models. [sent-1130, score-0.443]
88 While a comparison of all these word representations is beyond the scope of this paper, it is rather fair to question the quality of our word embeddings compared to a popular NLP approach. [sent-1138, score-0.591]
89 In this section, we report a comparison of our word embeddings against Brown clusters, when used as features into our neural network architecture. [sent-1139, score-0.473]
90 While we always picked the sentence approach for SRL, we had to consider the window approach in this particular convex setup, as the sentence approach network (see Figure 2) includes a Max layer. [sent-1144, score-0.721]
91 42 Table 13: Generalization performance of our neural network architecture trained with our language model (LM2) word embeddings, and with the word representations derived from the Brown Clusters. [sent-1191, score-0.789]
92 Additionally, POS is using a word suffix of size 2 feature, CHUNK is fed with POS, NER uses the CoNLL 2003 gazetteer, and SRL is fed with levels 1–5 of the Charniak parse tree, as well as a verb position feature. [sent-1193, score-0.592]
93 In the non-convex experiments, we fed these four Brown Cluster features to our architecture using four different lookup tables, replacing our word lookup table. [sent-1203, score-0.826]
94 The PT0 task replicates the sentence segmentation of the parse tree leaves. [sent-1231, score-0.488]
95 The corresponding benchmark score measures the quality of the Charniak parse tree leaves relative to the Penn Treebank gold parse trees. [sent-1232, score-0.416]
96 type of comparison should be taken with care, as combining a given feature with different word representations might not have the same effect on each word representation. [sent-1233, score-0.447]
97 Secondly, they predict the correctness of the final word of each window instead of the center word (Turian et al. [sent-1246, score-0.514]
98 1 Lookup Table Layer Given a matrix of parameters θ1 = W 1 and word (or discrete feature) indices [w]T , the layer outputs 1 the matrix: l W 1 W 1 W 1 fθ ([w]T ) = . [sent-1345, score-0.389]
99 [w]T l The gradients of the weights W 1 i are given by: ∂C ∂W 1 i ∂C l ∂ fθ ∑ = {1≤t≤T / [w]t =i} 1 i This sum equals zero if the index i in the lookup table does not corresponds to a word in the sequence. [sent-1349, score-0.49]
100 7 Sentence-Level Log-Likelihood The network outputs a matrix where each element fθ i,t gives a score for tag i at word t. [sent-1371, score-0.457]
wordName wordTfidf (topN-words)
[('srl', 0.3), ('sentence', 0.27), ('pos', 0.265), ('lookup', 0.211), ('word', 0.208), ('chunk', 0.179), ('sll', 0.179), ('layer', 0.153), ('parse', 0.147), ('embeddings', 0.144), ('chunking', 0.139), ('ner', 0.138), ('tags', 0.138), ('nlp', 0.133), ('conll', 0.121), ('arlen', 0.116), ('avukcuoglu', 0.116), ('cratch', 0.116), ('eston', 0.116), ('lmost', 0.116), ('ollobert', 0.116), ('ottou', 0.116), ('uksa', 0.116), ('xx', 0.114), ('dwin', 0.111), ('charniak', 0.108), ('language', 0.102), ('tag', 0.101), ('rocessing', 0.099), ('window', 0.098), ('nn', 0.096), ('logadd', 0.089), ('ltw', 0.089), ('anguage', 0.088), ('fed', 0.085), ('trained', 0.084), ('network', 0.083), ('linguistics', 0.082), ('layers', 0.08), ('tagging', 0.077), ('gazetteer', 0.074), ('xxxx', 0.074), ('architecture', 0.073), ('hu', 0.07), ('vp', 0.068), ('verb', 0.067), ('parsing', 0.062), ('semantic', 0.061), ('wsj', 0.058), ('koomen', 0.058), ('toutanova', 0.058), ('brown', 0.057), ('labeling', 0.056), ('dictionary', 0.056), ('convolutional', 0.055), ('clogadd', 0.053), ('hardtanh', 0.053), ('words', 0.052), ('benchmark', 0.051), ('syntactic', 0.051), ('pwa', 0.047), ('xxx', 0.047), ('unlabeled', 0.047), ('entity', 0.044), ('tasks', 0.043), ('networks', 0.043), ('iobes', 0.042), ('features', 0.038), ('task', 0.037), ('score', 0.037), ('dwrd', 0.037), ('senna', 0.037), ('wll', 0.037), ('xxxxxx', 0.037), ('training', 0.037), ('gildea', 0.036), ('gradients', 0.036), ('xes', 0.036), ('acl', 0.036), ('table', 0.035), ('ranking', 0.035), ('tree', 0.034), ('meeting', 0.032), ('shen', 0.032), ('kudo', 0.032), ('matsumoto', 0.032), ('mtl', 0.032), ('pradhan', 0.032), ('trainable', 0.031), ('tagger', 0.031), ('representations', 0.031), ('np', 0.031), ('outputs', 0.028), ('phrase', 0.028), ('rare', 0.028), ('pre', 0.028), ('florian', 0.027), ('technologies', 0.027), ('breeding', 0.026), ('caps', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
2 0.14234102 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
3 0.12768428 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.10130812 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
5 0.096670456 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
6 0.076910429 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
7 0.076624565 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
8 0.07097739 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
9 0.067904018 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
10 0.056750238 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
11 0.048928704 55 jmlr-2011-Learning Multi-modal Similarity
12 0.044459045 91 jmlr-2011-The Sample Complexity of Dictionary Learning
13 0.03636793 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
14 0.035447229 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
15 0.034860071 50 jmlr-2011-LPmade: Link Prediction Made Easy
16 0.03385606 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
17 0.033123601 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
18 0.031804979 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
19 0.030799676 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
20 0.030596623 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
topicId topicWeight
[(0, 0.182), (1, -0.129), (2, -0.01), (3, -0.035), (4, -0.281), (5, 0.044), (6, 0.165), (7, -0.01), (8, 0.007), (9, -0.09), (10, -0.232), (11, -0.207), (12, -0.107), (13, -0.025), (14, 0.045), (15, 0.012), (16, 0.086), (17, 0.037), (18, -0.018), (19, 0.062), (20, -0.043), (21, 0.032), (22, -0.1), (23, -0.022), (24, 0.009), (25, -0.021), (26, -0.06), (27, -0.134), (28, -0.123), (29, 0.105), (30, 0.045), (31, 0.005), (32, 0.091), (33, 0.111), (34, 0.107), (35, 0.054), (36, 0.033), (37, -0.049), (38, 0.061), (39, 0.032), (40, -0.008), (41, -0.022), (42, 0.097), (43, 0.167), (44, 0.217), (45, 0.088), (46, -0.022), (47, 0.03), (48, 0.023), (49, 0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.96151149 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
2 0.62417364 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
3 0.54968131 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.51117796 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar
Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning
5 0.4185389 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
6 0.35029241 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
7 0.31359279 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
8 0.23830782 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
9 0.23382063 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
10 0.22038767 91 jmlr-2011-The Sample Complexity of Dictionary Learning
11 0.2193006 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
12 0.19861996 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
13 0.1968964 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
14 0.19159372 55 jmlr-2011-Learning Multi-modal Similarity
15 0.18722045 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
16 0.18709373 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
17 0.16422375 50 jmlr-2011-LPmade: Link Prediction Made Easy
18 0.15922435 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
19 0.15559484 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
20 0.15422392 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
topicId topicWeight
[(4, 0.036), (9, 0.034), (10, 0.039), (24, 0.041), (31, 0.064), (32, 0.026), (36, 0.01), (41, 0.024), (45, 0.016), (60, 0.043), (71, 0.012), (73, 0.027), (78, 0.041), (82, 0.023), (90, 0.038), (99, 0.426)]
simIndex simValue paperId paperTitle
same-paper 1 0.76729977 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
2 0.55686259 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
3 0.32027572 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
4 0.2632792 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
5 0.25257483 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
6 0.24522994 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.24233016 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
8 0.23843159 16 jmlr-2011-Clustering Algorithms for Chains
9 0.23643801 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
10 0.23570168 12 jmlr-2011-Bayesian Co-Training
11 0.23545757 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
12 0.23515564 91 jmlr-2011-The Sample Complexity of Dictionary Learning
13 0.23502697 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.23486429 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
15 0.23393816 59 jmlr-2011-Learning with Structured Sparsity
16 0.2333889 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
17 0.23203164 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
18 0.23163716 66 jmlr-2011-Multiple Kernel Learning Algorithms
19 0.23070948 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
20 0.22991818 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments