acl acl2010 acl2010-197 knowledge-graph by maker-knowledge-mining

197 acl-2010-Practical Very Large Scale CRFs


Source: pdf

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Practical very large scale CRFs Thomas Lavergne LIMSI CNRS lavergne @ l ims i fr . [sent-1, score-0.179]

2 – Abstract Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. [sent-2, score-0.036]

3 Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. [sent-3, score-0.287]

4 In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. [sent-4, score-0.144]

5 Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. [sent-5, score-0.358]

6 Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. [sent-6, score-0.162]

7 Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets. [sent-7, score-0.069]

8 An important property of CRFs is their ability to handle large and redundant feature sets and to integrate structural dependency between output labels. [sent-10, score-0.122]

9 fr – Fran ¸cois Yvon Universit e´ Paris-Sud 11 LIMSI CNRS yvon@ l i fr ims . [sent-13, score-0.145]

10 – grows quadratically with respect to the number of output labels and so does the number of structural features, ie. [sent-14, score-0.252]

11 Limitating the feature set or the number of output labels is however frustrating for many NLP tasks, where the type and number of potentially relevant features are very large. [sent-20, score-0.186]

12 (2006) propose to use a “sparse” version of the forward-backward algorithm during training, where sparsity is enforced through beam pruning. [sent-23, score-0.078]

13 (2004); by Cohn (2006), who considers “generalized” feature functions; and by Jeong et al. [sent-25, score-0.048]

14 (2009), who use approximations to simplify the forward-backward recursions. [sent-26, score-0.046]

15 In this paper, we show that the sparsity that is induced by ‘1-penalized estimation of CRFs can be used to reduce the total training time, while yielding extremely compact models. [sent-27, score-0.172]

16 The benefits of sparsity are even greater during inference: less features need to be extracted and included in the potential functions, speeding up decoding with a lesser memory footprint. [sent-28, score-0.117]

17 We study and compare three different ways to implement ‘1 penalty for CRFs that have been introduced recently: orthantwise Quasi Newton (Andrew and Gao, 2007), stochastic gradient descent (Tsuruoka et al. [sent-29, score-0.671]

18 , 2010), concluding that these methods have complemen- 1In CRFsuite (Okazaki, 2007), it is even impossible to jointly test a pair of labels and a test on the observation, bigrams feature are only of the form f(yt−1 , yt). [sent-31, score-0.109]

19 c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti onngaulis Lti cnsg,u piasgtiecss 504–513, tary strengths and weaknesses. [sent-34, score-0.037]

20 Our contribution is therefore twofold: firstly a detailed analysis of these three algorithms, discussing implementation, convergence and comparing the effect of various speed-ups. [sent-36, score-0.04]

21 Second, the experimental demonstration that using large output label sets is doable and that very large feature sets actually help improve prediction accuracy. [sent-38, score-0.123]

22 In addition, we show how sparsity in structured feature sets can be used in incremental training regimes, where long-range features are progressively incorporated in the model insofar as the shorter range features have proven useful. [sent-39, score-0.204]

23 The rest of the paper is organized as follows: we first recall the basics ofCRFs in Section 2, and discuss three ways to train CRFs with a ‘1 penalty in Section 3. [sent-40, score-0.409]

24 We then detail several implementation issues that need to be addressed when dealing with massive feature sets in Section 4. [sent-41, score-0.089]

25 2 Conditional Random Fields In this section, we recall the basics of Conditional Random Fields (CRFs) (Lafferty et al. [sent-44, score-0.094]

26 , 2001 ; Sutton and McCallum, 2006) and introduce the notations that will be used throughout. [sent-45, score-0.045]

27 , yT) 2 functions and {θk}1≤k≤K are the associated parameter vsa aluneds. [sent-53, score-0.191]

28 { θ W}e denote by Y and X, respectively, the sets in which yt and xt take their values. [sent-54, score-0.515]

29 (2) The most common choice of feature functions is to use binary tests. [sent-56, score-0.133]

30 In the sequel, we distinguish between two types of feature functions: unigramfeatures fy,x, associated with parameters µy,x, and bigram features fy0,y,x, associated with parameters λy0,y,x. [sent-57, score-0.183]

31 These are defined as fy,x(yt−1, fy0,y,x(yt−1, xt) = 1(yt = y, xt = x) yt, xt) = 1(yt−1 = y0, yt = y, xt = x) yt, × where 1(cond. [sent-58, score-0.659]

32 In this setting, the number of parameters K is equal to |Y |2 |X|train, nwuhmerbee | · | fd penaoratemse ttheres sc Kard i sn eaql uaanld t |X|train r|eXfe|rs to wtheh nreum |·b|e dre nofo configurations oanfd xt Xo|bserved during training. [sent-60, score-0.23]

33 Thus, even in moderate size applications, the number of parameters can be very large, mostly due to the introduction of sequential dependencies in the model. [sent-61, score-0.092]

34 This also explains why it is hard to train CRFs with dependencies spanning more than two adjacent labels. [sent-62, score-0.12]

35 Using only unigram features {fy,x}(y,x)∈Y ×X results in a model equivalent etos a simple bag-of-tokens positionby-position logistic regression model. [sent-63, score-0.039]

36 On the other hand, bigram features {fy0,y,x}(y,x)∈Y2 are helpful i bni modelling dependencies xb)∈eYtwe×eXn successive labels. [sent-64, score-0.083]

37 The motivations for using simultaneously both types of feature functions are evaluated experimentally in Section 5. [sent-65, score-0.133]

38 tional regularization term so as to avoid overfitting 505 (see Section 3. [sent-68, score-0.214]

39 The gradient of l(θ) is ∂∂lθ(θk)=Xi=N1TtX=(i1)Epθ(y|x(i) fk(yt−1,yt,xt(i) −XNTX(i)fk(y(ti−)1,yt(i),xt(i) (4) Xi= X1 tX= X1 where Epθ(y|x) denotes the conditional expectation given the observation sequence, i. [sent-70, score-0.287]

40 fk(yt−1,yt,x(ti)) = X fk(y,y0,xt)Pθ(yt−1 = y0,yt = y|x) Epθ(y|x) (y0,Xy)∈Y2 (5) Although l(θ) is a smooth convex function, its optimum cannot be computed in closed form, and l(θ) has to be optimized numerically. [sent-72, score-0.099]

41 The computation of its gradient implies to repeatedly compute the conditional expectation in (5) for all input sequences and all positions t. [sent-73, score-0.292]

42 Then, Zθ (x) = Py αT(y) a anndd th alel pairwise probabilities Pθ (yt =P Py0, yt+1 = y|x) are given by αt(y0) exp(µy,xt+1 + λy0,y,xt+1)βt+1(y)/Zθ(x) These recursions require a number of operations that grows quadratically with |Y | . [sent-75, score-0.274]

43 1 Regularization The standard approach for parameter estimation in CRFs consists in minimizing the logarithmic loss l(θ) defined by (3) with an additional ‘2 penalty term ρ22 kθk22, where ρ2 is a regularization parameter. [sent-77, score-0.601]

44 Thek objective function is then a smooth convex function to be minimized over an unconstrained parameter space. [sent-78, score-0.24]

45 The only caveat is to avoid numerical optimizers that require the full Hessian matrix (e. [sent-81, score-0.093]

46 , Newton’s algorithm) due to the size of the parameter vector in usual applications of CRFs. [sent-83, score-0.069]

47 The most significant alternative to ‘2 regularization is to use a ‘1 penalty term ρ1 kθ k 1: such regularizers are able to yield sparse parameter vectors in which many component have been zeroed (Tibshirani, 1996). [sent-84, score-0.6]

48 Using a ‘1 penalty term thus implicitly performs feature selection, where ρ1 controls the amount of regularization and the number of extracted features. [sent-85, score-0.542]

49 However, the introduction of a ‘1 penalty term makes the optimization of (6) more problematic, as the objective function is no longer differentiable in 0. [sent-87, score-0.477]

50 Various strategies have been proposed to handle this difficulty. [sent-88, score-0.036]

51 It amounts to reparameterizing θk as θk = θk+ θk−, where θk+ and θk− are positive. [sent-93, score-0.037]

52 In this formulation, the objective function recovers its smoothness and can be optimized with conventional algorithms, subject to domain constraints. [sent-95, score-0.072]

53 Optimization is straightforward, but the number of parameters is doubled and convergence is slow − − 506 (Andrew and Gao, 2007): the procedure lacks a mechanism for zeroing out useless parameters. [sent-96, score-0.172]

54 , 2007), the authors show that OWL-QN is faster than the algorithm proposed by Kazama and Tsujii (2003) and can perform model selection even in very high-dimensional problems, with no loss of performance compared to the use of ‘2 penalty terms. [sent-101, score-0.28]

55 3 Stochastic Gradient Descent Stochastic gradient (SGD) approaches update the parameter vector based on an crude approximation of the gradient (4), where the computation of expectations only includes a small batch of observations. [sent-103, score-0.575]

56 , 2009), various ways of adapting this update to ‘1penalized likelihood functions are discussed. [sent-106, score-0.187]

57 Two effective ideas are proposed: (i) only update parameters that correspond to active features in the current observation, (ii) keep track of the cumulated penalty zk that θk should have received, had the gradient been computed exactly, and use this value to “clip” the parameter value. [sent-107, score-0.847]

58 This is implemented by patching the update (7) as follows (eifls (eθk if> (θk 0)< θk 0)← θk m←ax m(0i,nθ(k0,−θk zk−) zk) (8) Based on a study of three NLP benchmarks, the authors of (Tsuruoka et al. [sent-108, score-0.102]

59 , 2009) claim this ap- proach to be much faster than the orthant-wise approach and yet to yield very comparable performance, while selecting slightly larger feature sets. [sent-109, score-0.048]

60 4 Block Coordinate Descent The coordinate descent approach of Dud ı´k et al. [sent-111, score-0.243]

61 (2008) uses the fact that optimizing a mono-dimensional quadratic function augmented with a ‘1 penalty can be performed analytically. [sent-113, score-0.329]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('crfs', 0.417), ('yt', 0.371), ('penalty', 0.28), ('gradient', 0.181), ('regularization', 0.162), ('descent', 0.157), ('newton', 0.146), ('xt', 0.144), ('orthant', 0.125), ('kfk', 0.11), ('cnrs', 0.11), ('fk', 0.109), ('update', 0.102), ('sgd', 0.101), ('kx', 0.094), ('basics', 0.094), ('zk', 0.094), ('quadratically', 0.094), ('tsuruoka', 0.089), ('coordinate', 0.086), ('functions', 0.085), ('lavergne', 0.084), ('limsi', 0.084), ('recursions', 0.084), ('yvon', 0.084), ('zeroing', 0.084), ('gao', 0.082), ('sparsity', 0.078), ('hessian', 0.073), ('crfsuite', 0.073), ('differentiable', 0.073), ('objective', 0.072), ('parameter', 0.069), ('quasi', 0.067), ('conditional', 0.062), ('labels', 0.061), ('ep', 0.06), ('convex', 0.059), ('okazaki', 0.059), ('grows', 0.059), ('exp', 0.059), ('kazama', 0.057), ('bottou', 0.057), ('nocedal', 0.057), ('numerical', 0.056), ('yielding', 0.056), ('tsujii', 0.054), ('xi', 0.053), ('stochastic', 0.053), ('fields', 0.053), ('benchmarks', 0.052), ('term', 0.052), ('fr', 0.05), ('quadratic', 0.049), ('sutton', 0.049), ('implies', 0.049), ('feature', 0.048), ('parameters', 0.048), ('approximations', 0.046), ('ims', 0.045), ('billion', 0.045), ('notations', 0.045), ('dependencies', 0.044), ('observation', 0.044), ('expectations', 0.042), ('adjacent', 0.041), ('implementation', 0.041), ('sign', 0.041), ('lafferty', 0.041), ('convergence', 0.04), ('smooth', 0.04), ('xn', 0.039), ('features', 0.039), ('estimation', 0.038), ('sn', 0.038), ('output', 0.038), ('negated', 0.037), ('doable', 0.037), ('eifls', 0.037), ('friedman', 0.037), ('vsa', 0.037), ('optimizers', 0.037), ('zeroed', 0.037), ('elastic', 0.037), ('clip', 0.037), ('xb', 0.037), ('pal', 0.037), ('linearchain', 0.037), ('perkins', 0.037), ('vasserman', 0.037), ('xlogp', 0.037), ('alel', 0.037), ('reparameterizing', 0.037), ('tary', 0.037), ('vishwanathan', 0.037), ('handle', 0.036), ('train', 0.035), ('andrew', 0.035), ('ideas', 0.034), ('yen', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 197 acl-2010-Practical Very Large Scale CRFs

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

2 0.22347365 159 acl-2010-Learning 5000 Relational Extractors

Author: Raphael Hoffmann ; Congle Zhang ; Daniel S. Weld

Abstract: Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations more than an order of magnitude greater than any previous approach with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. — —

3 0.12703319 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

Author: Xiao Li

Abstract: Determining the semantic intent of web queries not only involves identifying their semantic class, which is a primary focus of previous works, but also understanding their semantic structure. In this work, we formally define the semantic structure of noun phrase queries as comprised of intent heads and intent modifiers. We present methods that automatically identify these constituents as well as their semantic roles based on Markov and semi-Markov conditional random fields. We show that the use of semantic features and syntactic features significantly contribute to improving the understanding performance.

4 0.10784394 214 acl-2010-Sparsity in Dependency Grammar Induction

Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.

5 0.10312661 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

6 0.08151143 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

7 0.081080899 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging

8 0.073497064 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes

9 0.06533245 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

10 0.05780841 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling

11 0.0558642 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

12 0.053898398 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

13 0.048981488 195 acl-2010-Phylogenetic Grammar Induction

14 0.048337705 130 acl-2010-Hard Constraints for Grammatical Function Labelling

15 0.048177227 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

16 0.048143227 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

17 0.047883488 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

18 0.047553569 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging

19 0.047335707 17 acl-2010-A Structured Model for Joint Learning of Argument Roles and Predicate Senses

20 0.045919765 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.147), (1, 0.014), (2, -0.014), (3, -0.008), (4, 0.005), (5, -0.038), (6, 0.028), (7, 0.005), (8, 0.081), (9, 0.032), (10, -0.109), (11, 0.006), (12, 0.037), (13, -0.099), (14, -0.035), (15, -0.026), (16, -0.062), (17, 0.084), (18, -0.009), (19, -0.019), (20, 0.023), (21, -0.01), (22, -0.105), (23, -0.058), (24, -0.065), (25, 0.067), (26, 0.188), (27, -0.148), (28, 0.071), (29, -0.003), (30, 0.033), (31, -0.039), (32, -0.077), (33, -0.031), (34, 0.104), (35, 0.213), (36, -0.007), (37, -0.088), (38, -0.007), (39, 0.047), (40, -0.013), (41, 0.08), (42, 0.069), (43, 0.197), (44, 0.03), (45, -0.138), (46, -0.048), (47, 0.069), (48, 0.074), (49, -0.217)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95875943 197 acl-2010-Practical Very Large Scale CRFs

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

2 0.71115637 159 acl-2010-Learning 5000 Relational Extractors

Author: Raphael Hoffmann ; Congle Zhang ; Daniel S. Weld

Abstract: Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations more than an order of magnitude greater than any previous approach with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. — —

3 0.46524352 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.

4 0.44282973 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa

Abstract: Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data. This paper proposes a Bayesian method for robust distributional word similarities. The method uses a distribution of context profiles obtained by Bayesian estimation and takes the expectation of a base similarity measure under that distribution. When the context profiles are multinomial distributions, the priors are Dirichlet, and the base measure is . the Bhattacharyya coefficient, we can derive an analytical form that allows efficient calculation. For the task of word similarity estimation using a large amount of Web data in Japanese, we show that the proposed measure gives better accuracies than other well-known similarity measures.

5 0.43820643 185 acl-2010-Open Information Extraction Using Wikipedia

Author: Fei Wu ; Daniel S. Weld

Abstract: Information-extraction (IE) systems seek to distill semantic relations from naturallanguage text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall. The key to WOE’s performance is a novel form of self-supervised learning for open extractors using heuris— tic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.

6 0.43619379 111 acl-2010-Extracting Sequences from the Web

7 0.38867781 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

8 0.37301341 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling

9 0.37281743 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

10 0.37252936 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging

11 0.37129864 214 acl-2010-Sparsity in Dependency Grammar Induction

12 0.36836028 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

13 0.35191923 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

14 0.35079122 182 acl-2010-On the Computational Complexity of Dominance Links in Grammatical Formalisms

15 0.35054576 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

16 0.33346719 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

17 0.313474 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

18 0.3026177 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

19 0.29381949 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

20 0.29092127 66 acl-2010-Compositional Matrix-Space Models of Language


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.281), (4, 0.017), (14, 0.023), (25, 0.063), (33, 0.018), (42, 0.016), (44, 0.015), (51, 0.012), (59, 0.093), (71, 0.012), (73, 0.051), (78, 0.028), (83, 0.137), (84, 0.029), (98, 0.121)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81389296 197 acl-2010-Practical Very Large Scale CRFs

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

2 0.70905292 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization

Author: Hitoshi Nishikawa ; Takaaki Hasegawa ; Yoshihiro Matsuo ; Genichiro Kikui

Abstract: We propose a novel algorithm for sentiment summarization that takes account of informativeness and readability, simultaneously. Our algorithm generates a summary by selecting and ordering sentences taken from multiple review texts according to two scores that represent the informativeness and readability of the sentence order. The informativeness score is defined by the number of sentiment expressions and the readability score is learned from the target corpus. We evaluate our method by summarizing reviews on restaurants. Our method outperforms an existing algorithm as indicated by its ROUGE score and human readability experiments.

3 0.63612145 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

Author: Mohit Bansal ; Dan Klein

Abstract: We present a simple but accurate parser which exploits both large tree fragments and symbol refinement. We parse with all fragments of the training set, in contrast to much recent work on tree selection in data-oriented parsing and treesubstitution grammar learning. We require only simple, deterministic grammar symbol refinement, in contrast to recent work on latent symbol refinement. Moreover, our parser requires no explicit lexicon machinery, instead parsing input sentences as character streams. Despite its simplicity, our parser achieves accuracies of over 88% F1 on the standard English WSJ task, which is competitive with substantially more complicated state-of-theart lexicalized and latent-variable parsers. Additional specific contributions center on making implicit all-fragments parsing efficient, including a coarse-to-fine inference scheme and a new graph encoding.

4 0.61103207 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: One goal of natural language generation is to produce coherent text that presents information in a logical order. In this paper, we show that topological fields, which model high-level clausal structure, are an important component of local coherence in German. First, we show in a sentence ordering experiment that topological field information improves the entity grid model of Barzilay and Lapata (2008) more than grammatical role and simple clausal order information do, particularly when manual annotations of this information are not available. Then, we incorporate the model enhanced with topological fields into a natural language generation system that generates constituent orders for German text, and show that the added coherence component improves performance slightly, though not statistically significantly.

5 0.60937166 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

Author: Junhui Li ; Guodong Zhou ; Hwee Tou Ng

Abstract: This paper explores joint syntactic and semantic parsing of Chinese to further improve the performance of both syntactic and semantic parsing, in particular the performance of semantic parsing (in this paper, semantic role labeling). This is done from two levels. Firstly, an integrated parsing approach is proposed to integrate semantic parsing into the syntactic parsing process. Secondly, semantic information generated by semantic parsing is incorporated into the syntactic parsing model to better capture semantic information in syntactic parsing. Evaluation on Chinese TreeBank, Chinese PropBank, and Chinese NomBank shows that our integrated parsing approach outperforms the pipeline parsing approach on n-best parse trees, a natural extension of the widely used pipeline parsing approach on the top-best parse tree. Moreover, it shows that incorporating semantic role-related information into the syntactic parsing model significantly improves the performance of both syntactic parsing and semantic parsing. To our best knowledge, this is the first research on exploring syntactic parsing and semantic role labeling for both verbal and nominal predicates in an integrated way. 1

6 0.60531354 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

7 0.60474962 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

8 0.60420483 71 acl-2010-Convolution Kernel over Packed Parse Forest

9 0.60322696 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

10 0.6022315 195 acl-2010-Phylogenetic Grammar Induction

11 0.60121739 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection

12 0.6009292 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

13 0.5997687 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

14 0.59975636 158 acl-2010-Latent Variable Models of Selectional Preference

15 0.59970236 73 acl-2010-Coreference Resolution with Reconcile

16 0.59930694 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information

17 0.59930247 39 acl-2010-Automatic Generation of Story Highlights

18 0.59930027 233 acl-2010-The Same-Head Heuristic for Coreference

19 0.59852362 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years

20 0.5982846 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features