acl acl2011 acl2011-189 knowledge-graph by maker-knowledge-mining

189 acl-2011-K-means Clustering with Feature Hashing

Source: pdf

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. [sent-3, score-0.495]

2 We address this issue by using feature hashing (Weinberger et al. [sent-4, score-0.539]

3 , 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. [sent-5, score-0.227]

4 Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. [sent-6, score-0.568]

5 Furthermore, to empirically verify our method, we experimented on a document clustering task. [sent-7, score-0.178]

6 1 Introduction In natural language processing (NLP) and text mining, clustering methods are crucial for various tasks such as document clustering. [sent-8, score-0.117]

7 Among them, Kmeans (MacQueen, 1967; Lloyd, 1982) is “the most important flat clustering algorithm” (Manning et al. [sent-9, score-0.083]

8 One of the major problems of K-means is that it has K centroids which are dense vectors where K is the number of clusters. [sent-11, score-0.301]

9 Thus, it is infeasible to store them in memory and slow to compute if the dimension of inputs is huge, as is often the case with NLP and text mining tasks. [sent-12, score-0.124]

10 (2009) introduced feature hashing, a simple yet effective and analyzable dimension-reduction technique for large-scale multitask learning. [sent-16, score-0.206]

11 The idea is to combine features which have the same hash value. [sent-17, score-0.405]

12 For example, given a hash function h and a vector x, if h(1012) = h(41234) = 42, we make a new vector y by set- = + ting y42 x1012 x41234 (or equally possibly x1012 −x41234, −x1012+x41234, or −x1012 −x41234). [sent-18, score-0.405]

13 This trick greatly reduces the size of dense vectors, since the maximum index value becomes equivalent to the maximum hash value of h. [sent-19, score-0.607]

14 Furthermore, unlike random projection (Achlioptas, 2003; Boutsidis et al. [sent-20, score-0.031]

15 , 2010), feature hashing retains sparsity of sparse input vectors. [sent-21, score-0.539]

16 An additional useful trait for NLP tasks is that it can save much memory by eliminating an alphabet storage (see the preliminaries for detail). [sent-22, score-0.221]

17 The authors also justified their method by showing that with feature hashing, dotproduct is unbiased, and the length of each vector is well-preserved with high probability under some conditions. [sent-23, score-0.115]

18 Plausibly this technique is useful also for clustering methods such as K-means. [sent-24, score-0.134]

19 In this paper, to motivate applying feature hashing to K-means, we show the residual sum of squares, the objective of K-means, is well-preserved under feature hashing. [sent-25, score-0.792]

20 We also demonstrate an experiment on document clustering and see the feature size can be shrunk into 3. [sent-26, score-0.265]

21 1 Notation In this paper, | | · | | denotes the Euclidean norm, and h·, ·i sdo peasp ethr,e | |d ·o |t| product. [sent-31, score-0.028]

22 If we want to group input vectors into K clusters, K-means can surely output clusters ω1 , . [sent-35, score-0.164]

23 , µK such that they locally minimize the residual sum of squares (RSS) which is defined as XK X X ||x − µk||2. [sent-41, score-0.291]

24 Xk=1 xX∈ωk In the algorithm, µk is made into the mean of the vectors in a cluster ωk. [sent-42, score-0.103]

25 Note that RSS can be regarded as a metric since the sum of each metric (in this case, squared Euclidean distance) becomes also a metric by con- structing a 1-norm product metric. [sent-44, score-0.256]

26 3 Additive distortion Suppose one wants to embed a metric space (X, d) into another one (X0, d0) by a mapping φ. [sent-46, score-0.139]

27 4 Hashing tricks According to an account by John Langford 1, a co-author of papers on feature hashing (Shi et al. [sent-52, score-0.699]

28 , 2009), hashing tricks for dimension-reduction were implemented in various machine learning libraries including Vowpal Wabbit, which he realesed in 2007. [sent-54, score-0.613]

29 Ganchev and Dredze (2008) named their hashing trick random feature mixing and empirically supported it by experimenting on NLP tasks. [sent-55, score-0.642]

30 It is similar to feature hashing except lacking of a binary hash 1http : / /hunch . [sent-56, score-0.944]

31 The paper also showed that hashing tricks are useful to eliminate alphabet storage. [sent-59, score-0.686]

32 (2009) suggested hash kernel, that is, dot product on a hashed space. [sent-61, score-0.742]

33 They conducted thorough research both theoretically and experimentally, extending this technique to classification of graphs and multi-class classification. [sent-62, score-0.085]

34 (2009) 2 introduced a technique feature hashing (a function itself is called the hashed feature map), which incorporates a binary hash function into hashing tricks in order to guarantee the hash kernel is unbiased. [sent-65, score-2.493]

35 They also showed applications to various real-world applications such as multitask learning and collaborative filtering. [sent-66, score-0.069]

36 Though their proof for exponential tail bounds in the original paper was refuted later, they reproved it under some extra conditions in the latest version. [sent-67, score-0.078]

37 Let S be a set of hashable features, h be a hash function h : S → {1, . [sent-71, score-0.469]

38 oTnh eh hashed feature map nφd(h ,ξξ) b : Rξ|S :| → →Rm { ±is1 a . [sent-75, score-0.481]

39 fu Tnhcteio hna sshuecdh ftheaattu trhee mi-athp e φlement of φ(→h,ξ)( Rx) is given by φ(ih,ξ)(x) = j X ξ(j)xj. [sent-76, score-0.056]

40 As well, a kernel function is defined on a hashed feature map. [sent-78, score-0.48]

41 The hash kernel h·, ·iφ is defined as hx, x0iφ = hφ(x) , φ(x0)i . [sent-81, score-0.462]

42 They also proved the following theorem, which we use in our analysis. [sent-82, score-0.03]

43 The hash kernel is unbiased, that is, Eφ [hx, x0iφ] = hx, x0i . [sent-85, score-0.462]

44 The variance is V arφ[hx,x0iφ] =m1Xi6=jxi2x0j2+ xixi0xjx0j 2The latest version of this paper is at arXiv http : / / arxiv . [sent-86, score-0.143]

45 2 2 0 6, with correction to Theorem 3 in the original paper included in the Proceeding of ICML ’09. [sent-88, score-0.039]

46 1 Eliminating alphabet storage In this kind of hashing tricks, an index of inputs do not have to be an integer but can be any hashable value, including a string. [sent-91, score-0.707]

47 Ganchev and Dredze (2008) argued this property is useful particularly for implementing NLP applications, since we do not anymore need an alphabet, a dictionary which maps features to parameters. [sent-92, score-0.032]

48 For instance, a feature ‘the current word ends with -ing’ can be expressed as a string cur :end : ing (here we suppose : is a control character). [sent-95, score-0.161]

49 Since indices of dense vectors (which may be implemented with arrays) must be integers, traditionally we need a dictionary to map these strings to integers, which may waste much memory. [sent-96, score-0.311]

50 Feature hashing removes this memory waste by converting strings to integers with on-the-fly computation. [sent-97, score-0.615]

51 3 Method For dimension-reduction to K-means, we propose a new method hashed K-means. [sent-98, score-0.337]

52 Given a hashed feature map φ, hashed K-means runs K-means on φ(x1) , . [sent-103, score-0.818]

53 4 Analysis In this section, we show clusters obtained by the hashed K-means are also good clusters in the original space with high probability. [sent-107, score-0.498]

54 (2009) proved a theorem on (multiplicative) distortion for Euclidean distance under some tight conditions, we illustrate (additive) distortion for RSS. [sent-109, score-0.399]

55 Since K-means is a process which monotonically decreases RSS in each step, if RSS is not distorted so much by feature hashing, we can expect results to be reliable to some extent. [sent-110, score-0.142]

56 Let us define the difference of the residual sum of squares (DRSS). [sent-111, score-0.26]

57 , µK be their corresponding centroids in the original space, φ be a hashed feature map, and . [sent-120, score-0.566]

58 Xk=1 xX∈ωk Before analysis, we define a notation for the (Euclidean) length under a hashed space: Definition 4. [sent-125, score-0.337]

59 The hash length | | · | |φ is defined as | |x | |φ = | |φ(x) | | = phφ(x),φ(x)i =qhx,xiφ. [sent-127, score-0.405]

60 To this end, it is vital to know the expectation and variance of the sum of squared hash lengths. [sent-131, score-0.669]

61 Because the variance of the sum of random variables derives from each covariance between pairs of variables, first we show the covariance between the squared hash length of two vectors. [sent-132, score-0.858]

62 The covariance between the squared hash length of two vectors x, y ∈ Rn is Covφ(||x||φ2, ||y||φ2) = where ψ(x,y) = ψ(xm,y), 2Xxixjyiyj. [sent-135, score-0.67]

63 Xi6=j This lemma can be proven by the same technique described in the Appendix A of Weinberger et al. [sent-136, score-0.126]

64 in- Since the expectation of a sum is the sum of expectations we readily know the zero expectation: Eφ[X] = 0. [sent-160, score-0.218]

65 Since adding constants to the inputs of covariance does not change its result, from Lemma 4. [sent-161, score-0.129]

66 Because the variance of the sum of random variables is the sum of the covariances between every pair of them, V arφ[X] =m1XiN=1jX=N1ψ(xi,xj). [sent-163, score-0.323]

67 Finally, we see the following theorem for additive distortion. [sent-164, score-0.217]

68 Let be the sum of ψ(x, y) for any observed pair of x, y, each of which expresses the difference between an example and its corresponding centroid. [sent-167, score-0.089]

69 −2 where 0 < γ <= 1, with probability at l eγast 1−γ, RSS is additively distorted by ? [sent-173, score-0.12]

70 Note that a hashed feature map is linear, since φ(x) = Mx with a matrix M such that MPi,j = ξ(i)δh(i),j. [sent-176, score-0.481]

71 The existence of Ψ in the theorem suggests that to use feature hashing, we should remove useless features which have high values from data in advance. [sent-181, score-0.237]

72 For example, if frequencies of words are used as 125 hash szie m Figure 1: The change of F5-measure along with the hash size features, function words should be ignored not only because they give no information for clustering but also because their high frequencies magnify distortion. [sent-182, score-0.923]

73 5 Experiments To empirically verify our method, from 20 Newsgroups, a dataset for document classification or clus- tering 3, we chose 6 classes and randomly drew 100 documents for each class. [sent-183, score-0.095]

74 We used unigrams and bigrams as features and ran our method for various hash sizes m (Figure 1). [sent-184, score-0.405]

75 The number of unigrams is 33,017 and bigrams 109,395, so the feature size in the original space is 142,412. [sent-185, score-0.155]

76 For example, if a document pair in an output cluster is actually in the same class, it is counted as true positive. [sent-189, score-0.062]

77 In contrast, if it is actually in the different class, it is counted as false positive. [sent-190, score-0.028]

78 At the first look, it seems odd that performance can be higher than the original where m is low. [sent-200, score-0.039]

79 A possible hypothesis is that since K-means only locally minimizes RSS but in general there are many local minima which are far from the global optimal point, therefore distortion can be sometimes useful to escape from a bad local minimum and reach a better one. [sent-201, score-0.14]

80 As a rule, however, large distortion kills clustering performance as shown in the figure. [sent-202, score-0.192]

81 Although clustering is heavily case-dependent, in this experiment, the resulting clusters are still reliable where the hash size is 3. [sent-203, score-0.579]

82 Combining their method and the feature hashing as shown in our paper will produce a new efficient method (possibly it can be named hashed K-means++). [sent-206, score-0.876]

83 We will analyze and experiment with this method in the future. [sent-207, score-0.032]

84 7 Conclusion In this paper, we argued that applying feature hashing to K-means is beneficial for memory-efficiency. [sent-208, score-0.571]

85 We supported our argument and analysis by an experiment on document clustering, showing we could safely shrink memory-usage into 3. [sent-210, score-0.095]

86 In the future, we will analyze the technique on other learning methods such as Kmeans++ and experiment on various real-data NLP tasks. [sent-212, score-0.083]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hashing', 0.453), ('hash', 0.405), ('hashed', 0.337), ('weinberger', 0.224), ('rss', 0.197), ('tricks', 0.16), ('theorem', 0.151), ('drss', 0.128), ('distortion', 0.109), ('centroids', 0.104), ('vectors', 0.103), ('xk', 0.101), ('dense', 0.094), ('squares', 0.093), ('sum', 0.089), ('kmeans', 0.089), ('feature', 0.086), ('hx', 0.085), ('covariance', 0.085), ('clustering', 0.083), ('residual', 0.078), ('squared', 0.077), ('euclidean', 0.075), ('lemma', 0.075), ('alphabet', 0.073), ('langford', 0.073), ('multitask', 0.069), ('additive', 0.066), ('additively', 0.064), ('boutsidis', 0.064), ('chebyshev', 0.064), ('hashable', 0.064), ('integers', 0.062), ('clusters', 0.061), ('map', 0.058), ('variance', 0.058), ('kernel', 0.057), ('ganchev', 0.057), ('cov', 0.056), ('waste', 0.056), ('distorted', 0.056), ('lloyd', 0.056), ('shi', 0.055), ('macqueen', 0.052), ('technique', 0.051), ('xx', 0.048), ('suppose', 0.047), ('arxiv', 0.046), ('memory', 0.044), ('pi', 0.044), ('unbiased', 0.044), ('inputs', 0.044), ('manning', 0.042), ('px', 0.041), ('trick', 0.041), ('xi', 0.041), ('expectation', 0.04), ('latest', 0.039), ('original', 0.039), ('projections', 0.038), ('tokyo', 0.038), ('smola', 0.038), ('tp', 0.037), ('arthur', 0.037), ('index', 0.037), ('infeasible', 0.036), ('storage', 0.036), ('preliminaries', 0.036), ('let', 0.035), ('nlp', 0.035), ('theoretically', 0.034), ('dredze', 0.034), ('document', 0.034), ('argued', 0.032), ('eliminating', 0.032), ('experiment', 0.032), ('empirically', 0.031), ('random', 0.031), ('locally', 0.031), ('rn', 0.03), ('definition', 0.03), ('size', 0.03), ('metric', 0.03), ('proved', 0.03), ('verify', 0.03), ('showing', 0.029), ('huge', 0.029), ('variables', 0.028), ('counted', 0.028), ('anastasios', 0.028), ('covariances', 0.028), ('achlioptas', 0.028), ('dimitris', 0.028), ('trhee', 0.028), ('kilian', 0.028), ('sdo', 0.028), ('abs', 0.028), ('anr', 0.028), ('cur', 0.028), ('hna', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

2 0.43943584 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

Abstract: Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance. 1

3 0.16260397 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.

4 0.095409237 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.

5 0.0585053 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Author: Andreas Zollmann ; Stephan Vogel

Abstract: In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction. The proposals range from simple tag-combination schemes to a phrase clustering model that can incorporate an arbitrary number of features. Our models improve translation quality over the single generic label approach of Chiang (2005) and perform on par with the syntactically motivated approach from Zollmann and Venugopal (2006) on the NIST large Chineseto-English translation task. These results persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available.

6 0.052924223 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

7 0.0526754 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

8 0.051749308 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

9 0.051558878 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

10 0.049552642 342 acl-2011-full-for-print

11 0.046869595 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

12 0.045602031 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

13 0.043310393 204 acl-2011-Learning Word Vectors for Sentiment Analysis

14 0.04224176 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

15 0.042107034 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

16 0.040596299 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

17 0.039206304 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

18 0.038607299 106 acl-2011-Dual Decomposition for Natural Language Processing

19 0.038094796 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

20 0.037693717 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, 0.016), (2, -0.028), (3, 0.006), (4, -0.021), (5, -0.051), (6, 0.002), (7, -0.005), (8, -0.008), (9, 0.02), (10, 0.056), (11, 0.001), (12, 0.031), (13, 0.063), (14, -0.017), (15, -0.023), (16, -0.087), (17, 0.011), (18, 0.001), (19, -0.05), (20, 0.067), (21, -0.149), (22, 0.131), (23, -0.039), (24, -0.08), (25, -0.135), (26, -0.001), (27, -0.065), (28, 0.089), (29, -0.054), (30, 0.16), (31, 0.123), (32, 0.238), (33, 0.408), (34, -0.062), (35, -0.169), (36, 0.096), (37, -0.153), (38, -0.035), (39, -0.02), (40, -0.095), (41, 0.045), (42, -0.114), (43, 0.079), (44, -0.018), (45, 0.134), (46, 0.092), (47, -0.039), (48, 0.015), (49, -0.098)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95037413 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

2 0.93555593 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

3 0.75291359 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

4 0.65217125 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

5 0.36067456 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

Author: Hakan Ceylan ; Rada Mihalcea

Abstract: We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substi- tution task by a simple implementation that gives considerable improvement over a basic language model.

6 0.29144394 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

7 0.28504395 1 acl-2011-(11-06-spirl)

8 0.27704141 239 acl-2011-P11-5002 k2opt.pdf

9 0.26842687 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

10 0.25922784 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

11 0.24431533 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

12 0.22495431 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

13 0.21812423 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

14 0.20945381 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

15 0.20936494 291 acl-2011-SystemT: A Declarative Information Extraction System

16 0.20728266 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

17 0.20255448 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

18 0.20179152 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

19 0.19850703 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems

20 0.19759187 342 acl-2011-full-for-print

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.023), (17, 0.036), (37, 0.076), (39, 0.026), (41, 0.547), (55, 0.021), (59, 0.038), (72, 0.013), (77, 0.013), (91, 0.03), (96, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94802535 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

Author: Siwei Wang ; Gina-Anne Levow

Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.

2 0.94101846 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu

Abstract: Event extraction is the task of detecting certain specified types of events that are mentioned in the source language data. The state-of-the-art research on the task is transductive inference (e.g. cross-event inference). In this paper, we propose a new method of event extraction by well using cross-entity inference. In contrast to previous inference methods, we regard entitytype consistency as key feature to predict event mentions. We adopt this inference method to improve the traditional sentence-level event extraction system. Experiments show that we can get 8.6% gain in trigger (event) identification, and more than 11.8% gain for argument (role) classification in ACE event extraction. 1

3 0.92144483 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

Author: Antske Fokkens

Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.

same-paper 4 0.91303337 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

5 0.90452409 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

Author: Jinho D. Choi ; Martha Palmer

Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.

6 0.9042598 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

7 0.90214491 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

8 0.89145434 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

9 0.81612635 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

10 0.62100077 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

11 0.61310095 94 acl-2011-Deciphering Foreign Language

12 0.60669905 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

13 0.59899628 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

14 0.58276016 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

15 0.56975424 135 acl-2011-Faster and Smaller N-Gram Language Models

16 0.56591934 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

17 0.56407738 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

18 0.55540556 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

19 0.55399477 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.54902744 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing