acl acl2011 acl2011-80 knowledge-graph by maker-knowledge-mining

80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

Source: pdf

Author: Oliver Schneider ; Alex Garnett

Abstract: We present ConsentCanvas, a system which structures and “texturizes” End-User License Agreement (EULA) documents to be more readable. The system aims to help users better understand the terms under which they are providing their informed consent. ConsentCanvas receives unstructured text documents as input and uses unsupervised natural language processing methods to embellish the source document using a linked stylesheet. Unlike similar usable security projects which employ summarization techniques, our system preserves the contents of the source document, minimizing the cognitive and legal burden for both the end user and the licensor. Our system does not require a corpus for training. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We present ConsentCanvas, a system which structures and “texturizes” End-User License Agreement (EULA) documents to be more readable. [sent-4, score-0.065]

2 The system aims to help users better understand the terms under which they are providing their informed consent. [sent-5, score-0.028]

3 ConsentCanvas receives unstructured text documents as input and uses unsupervised natural language processing methods to embellish the source document using a linked stylesheet. [sent-6, score-0.267]

4 Unlike similar usable security projects which employ summarization techniques, our system preserves the contents of the source document, minimizing the cognitive and legal burden for both the end user and the licensor. [sent-7, score-0.508]

5 1 Introduction Less than 2% of users read End-User License Agreement (EULA) documents when indicating their consent to the software installation process (Good et al. [sent-9, score-0.239]

6 While these documents often serve as a user’s sole direct interaction with the legal terms of the software, they are usually not read, as they are presented in such a way as is divorced from the use of the software itself (Friedman et al. [sent-11, score-0.242]

7 To address this, Kay and Terry (2010) developed what they call Textured Consent agreements which employ a linked stylesheet to augment salient parts of a EULA document. [sent-13, score-0.195]

8 We have developed a system, ConsentCanvas, for automating the creation of a Textured Consent document from an unstructured EULA based on the example XHTML/CSS template provided by Kay and Terry (2010; Figure 1). [sent-15, score-0.165]

9 Instead, it makes use of regular expressions and correlation functions to identify variable-length relevant phrases (Kim and Chan, 2004) to alter the document’s structure and appearance. [sent-17, score-0.313]

10 The system automates the labour intensive manual process used by Kay and Terry (2010). [sent-19, score-0.029]

11 We also present the first available implementation of Kim and Chan’s algorithm (2004). [sent-21, score-0.04]

12 As such, we contribute not just a working application, but also an extensible framework for the visual embellishment of plaintext documents. [sent-34, score-0.053]

13 1 Analysis Our system takes plain-text EULA documents as input through a simple command line interface. [sent-36, score-0.065]

14 It then passes this document to four independent submodules for analysis. [sent-37, score-0.129]

15 Each submodule stores the initial and final character positions of a string selected from within the document body, but does not modify the document before reaching the renderer step. [sent-38, score-0.354]

16 2 Variable-Length Phrase Finder The variable-length phrase finder module features a Python implementation of the Variable-Length Phrase Finding (VLPF) Algorithm by Kim and Chan (2004). [sent-40, score-0.283]

17 Kim and Chan’s algorithm was chosen for its domain independence and adaptability, as it can be fine-tuned to use different correlation functions. [sent-41, score-0.098]

18 42 This algorithm computes the conditional probability for the relative importance of variable-length ngram phrases from the source document alone. [sent-44, score-0.229]

19 It begins by considering every word a phrase with a length of one. [sent-45, score-0.043]

20 That is, every phrase of length m P{m} is considered as P{m-1 }w, where w is a following adjacent word. [sent-47, score-0.043]

21 Correlation is calculated between the leading phrase P{m-1 } and the trailing word w. [sent-48, score-0.091]

22 Phrases that maintain a high level of correlation are creating by appending the trailing word w, and those with a correlation score below a certain threshold are pruned before the next iteration. [sent-49, score-0.244]

23 This continues until no more phrases can be created. [sent-50, score-0.1]

24 The VLPF algorithm is able to use any of several existing correlation functions. [sent-52, score-0.098]

25 We have implemented the Piatetsky-Shapiro correlation function, the simplest of the three best-performing functions used by Kim and Chan, which achieved a correlation of 92. [sent-53, score-0.196]

26 We removed English stopwords, but did not perform any stemming when selecting relevant phrases because the selection of VLPs did not depend on global term co-occurrence, and we did not want to modify selected exact phrases. [sent-55, score-0.135]

27 We emphasize the top 15% meaningful phrases (as deter- mined by the algorithm) for the entire document. [sent-56, score-0.144]

28 15% was chosen for its comparable results to Kay and Terry’s example document (2010). [sent-57, score-0.129]

29 The phrase selected as the most relevant is also reproduced in the pull quote at the top of the document, as shown in Figure 3. [sent-58, score-0.252]

30 3 Contact Information Extractor The contact information extractor module uses regular expressions to match URLs, email addresses, or phone numbers within the document text. [sent-60, score-0.522]

31 4 Segmenter The segmenter module uses Hearst’s TextTiling algorithm to “segment text into multi-paragraph subtopic passages” (1997). [sent-63, score-0.234]

32 ConsentCanvas uses the NLTK implementation of the TextTiling algorithm. [sent-65, score-0.04]

33 Segmentation was not applied to the entire document (doing this resulted in a messy layout incoherent with structuring applied by headers and titles). [sent-66, score-0.204]

34 Instead, we used it to identify the lead paragraph of the document, which was rendered differently using the “lead paragraph” container in the template. [sent-67, score-0.067]

35 Future versions will use a more modern segmenting algorithm. [sent-68, score-0.03]

36 5 Header Extractor The header extractor module uses regular expressions to match any section header-like text from the original document. [sent-70, score-0.498]

37 Several different search strings were used to catch multiple potential header types, including but not limited to: • • • • 8 OR FEWER 3. [sent-71, score-0.201]

38 1Multi-level Eight or fewer ALL-CAPS TOKENS numbered headers numbered headers tokens separated by a line break Figure 3. [sent-73, score-0.238]

39 6 Rendering Each analysis submodule produces a list of character positions where found items begin and end. [sent-76, score-0.096]

40 These are passed to our rendering system, which inserts the corresponding HTML5 tags at the positions in original plaintext EULA. [sent-77, score-0.154]

41 We append a header to the output document to include the linked stylesheet per HTML5 specifications. [sent-78, score-0.433]

42 3 Analysis & Results We conducted a brief qualitative analysis on ConsentCanvas after implementation and debugging. [sent-79, score-0.067]

43 However, the problem space and system are not yet ready for formal verification or experimentation. [sent-80, score-0.032]

44 More exploration and refinement are required before we will be able to empirically determine if we have improved readability and comprehension. [sent-81, score-0.077]

45 1 Corpus We conducted our analysis on a small sample of EULAs from the same collection used by Lavesson et al. [sent-83, score-0.027]

46 In several of the best examples of texturized EULAs security concerns were highlighted; in the texturized version of one document, the pull quote was “on media, ICONIX, Inc. [sent-89, score-0.409]

47 warrants that such media is free from defects in materials and workmanship under normal use for a period of ninety (90) days from the date of purchase as evidenced by a copy of the receipt. [sent-90, score-0.093]

48 is free to use any ideas, concepts,” “(except one copy for backup purposes),” and “Inc. [sent-94, score-0.066]

49 ” Some phrases have incomplete words at the beginning and end; this is an artifact of a known but unfixed bug in the implementation, not a result of the algorithm. [sent-97, score-0.1]

50 Several short but frequent phrases were found to be VLPs, such as “Inc. [sent-99, score-0.1]

51 In short licenses consisting of only one to three paragraphs, sometimes no relevant VLPs were discovered. [sent-101, score-0.064]

52 There are also many phrases that should be highlighted that are not. [sent-102, score-0.132]

53 3 Preliminary System Evaluation We conducted an informal evaluation in which our system applied texture to 15 documents chosen from our corpus at random. [sent-104, score-0.192]

54 The pull quote text was nearly unintelligible in almost all cases, due largely to the fact that it did not split evenly on sentence borders. [sent-108, score-0.203]

55 We did not let this detract from our evaluation of the documents, because performance in this area was so consistently, and charmingly, poor, but did not affect readability of the main document body. [sent-109, score-0.235]

56 1 Comparisons with Kay and Terry Kay and Terry (2010) make reference to “augmenting and embellishing” the document text specifically not altering the original content. [sent-112, score-0.129]

57 However, their example document is written concisely in a user-friendly voice dissimilar to most formal EULAs found in the wild. [sent-113, score-0.161]

58 2 Handling Legal Language We had anticipated a considerable amount of difficult-to-understand legal language in the source document. [sent-116, score-0.148]

59 However, most documents were found to contain a number of high-frequency VLPs with both layperson-salient legal terminology and common clues to document structure. [sent-117, score-0.342]

60 The variable-length phrase finding module only incorporates a single correlation function. [sent-120, score-0.261]

61 Machine learning techniques might also be used to classify phrases as relevant or not, leading to better-emphasized content. [sent-122, score-0.135]

62 In the example license designed by Kay and Terry (2010), there are one or two emphasized phrases in each section. [sent-124, score-0.263]

63 The phrases found by ConsentCanvas are often sporadic, clustering in some sections and absent from others. [sent-125, score-0.1]

64 As a result of this, readability suffers, and so we may need to look into possible stratification of VLPs. [sent-126, score-0.077]

65 This might also aid multi-lingual documents, of which there are a few examples (a cursory look showed the results in French were comparable to those in culty in selecting meaningful phrases from diffi- English in a bilingual EULA in our corpus). [sent-127, score-0.173]

66 44 Contact information is currently emphasized in the same manner as salient phrases. [sent-128, score-0.085]

67 We plan to eventually embed hyperlinks for all URLs and email addresses found in the source document, as in Kay and Terry (2010). [sent-129, score-0.038]

68 The segmenter module uses the basic TextTiling algorithm with default parameters. [sent-130, score-0.186]

69 We plan to improve header extractor providing more sophisticated regular expressions; we found that a wide variety of header styles were used. [sent-133, score-0.539]

70 In particular, we plan to consider layouts that use digits, punctuation, or inconsistent capitalization in multiple instances in the document body. [sent-134, score-0.158]

71 There is currently no module that incorporates the “Warning” box from Kay and Terry (2010). [sent-135, score-0.12]

72 This module would be designed to select relevant multiline blocks of text by using techniques similar to the variable-length phrase finder or the segmenter. [sent-136, score-0.278]

73 This will enable customized texturing of EULAs and facilitate experimentation for understanding and evaluating gains in comprehension and readability. [sent-138, score-0.066]

74 Finally, we will conduct a formal user evaluation of ConsentCanvas. [sent-139, score-0.078]

75 5 Conclusion We have provided a description of the work in progress for ConsentCanvas, a system for automatically adding texture to EULAs to improve readability and comprehension. [sent-140, score-0.143]

76 45 Appendix The source code, our corpus, and a sample of converted documents are all available at: https://github. [sent-145, score-0.065]

77 Legal text summarization by exploration of the thematic structures and argumentative roles. [sent-149, score-0.056]

78 Stopping spyware at the gate: a user study of privacy, notice and spyware. [sent-163, score-0.112]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('consentcanvas', 0.494), ('eula', 0.231), ('eulas', 0.231), ('kay', 0.226), ('header', 0.201), ('vlps', 0.198), ('terry', 0.19), ('legal', 0.148), ('consent', 0.145), ('iconix', 0.132), ('textured', 0.132), ('chan', 0.129), ('document', 0.129), ('module', 0.12), ('texttiling', 0.107), ('license', 0.106), ('security', 0.103), ('usable', 0.103), ('phrases', 0.1), ('pull', 0.099), ('vlpf', 0.099), ('correlation', 0.098), ('extractor', 0.097), ('privacy', 0.091), ('kim', 0.089), ('finder', 0.08), ('readability', 0.077), ('headers', 0.075), ('quote', 0.075), ('cranor', 0.066), ('farzindar', 0.066), ('kelley', 0.066), ('lavesson', 0.066), ('spyware', 0.066), ('stylesheet', 0.066), ('submodule', 0.066), ('texture', 0.066), ('texturing', 0.066), ('texturized', 0.066), ('segmenter', 0.066), ('documents', 0.065), ('agreements', 0.064), ('contact', 0.058), ('emphasized', 0.057), ('plaintext', 0.053), ('nltk', 0.048), ('subtopic', 0.048), ('trailing', 0.048), ('user', 0.046), ('friedman', 0.046), ('python', 0.046), ('meaningful', 0.044), ('burden', 0.044), ('numbered', 0.044), ('rendering', 0.044), ('phrase', 0.043), ('expressions', 0.04), ('regular', 0.04), ('implementation', 0.04), ('symposium', 0.039), ('markup', 0.039), ('email', 0.038), ('paragraph', 0.038), ('copy', 0.037), ('urls', 0.037), ('linked', 0.037), ('unstructured', 0.036), ('minimizing', 0.035), ('relevant', 0.035), ('hearst', 0.035), ('informal', 0.034), ('highlighted', 0.032), ('formal', 0.032), ('positions', 0.03), ('segmenting', 0.03), ('software', 0.029), ('summarization', 0.029), ('licenses', 0.029), ('phrasing', 0.029), ('automates', 0.029), ('backup', 0.029), ('container', 0.029), ('cursory', 0.029), ('detract', 0.029), ('garnett', 0.029), ('konstan', 0.029), ('layouts', 0.029), ('mall', 0.029), ('prevention', 0.029), ('unintelligible', 0.029), ('unmodified', 0.029), ('variablelength', 0.029), ('warning', 0.029), ('warrants', 0.029), ('informed', 0.028), ('salient', 0.028), ('conducted', 0.027), ('purchase', 0.027), ('inserts', 0.027), ('argumentative', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

Author: Oliver Schneider ; Alex Garnett

2 0.062874287 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose

Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).

3 0.051258095 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

Author: William M. Darling ; Fei Song

Abstract: Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics, however, the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage, where syntactical words are seen as semantically relevant, and overcoverage, where words related to content are ignored. We present a generative probabilistic modeling approach to building content distributions for use with statistical multi-document summarization where the syntax words are learned directly from the data with a Hidden Markov Model and are thereby deemphasized in the term frequency statistics. This approach is compared to both a stopword-list and POS-tagging approach and our method demonstrates improved coverage on the DUC 2006 and TAC 2010 datasets using the ROUGE metric.

4 0.05064436 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

5 0.05043589 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

6 0.049174439 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

7 0.048113 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

8 0.046271019 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

9 0.044622339 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

10 0.042783488 115 acl-2011-Engkoo: Mining the Web for Language Learning

11 0.041060161 263 acl-2011-Reordering Constraint Based on Document-Level Context

12 0.040742632 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

13 0.040477883 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

14 0.039077614 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

15 0.038986206 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

16 0.037729274 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

17 0.03760656 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

18 0.03749245 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

19 0.0369348 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

20 0.036735021 76 acl-2011-Comparative News Summarization Using Linear Programming

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.113), (1, 0.026), (2, -0.024), (3, 0.056), (4, -0.034), (5, 0.004), (6, -0.002), (7, 0.021), (8, 0.022), (9, -0.0), (10, -0.031), (11, 0.003), (12, -0.019), (13, -0.019), (14, -0.026), (15, -0.021), (16, 0.033), (17, 0.014), (18, 0.023), (19, -0.001), (20, 0.012), (21, -0.001), (22, -0.006), (23, -0.0), (24, 0.006), (25, 0.009), (26, 0.036), (27, 0.004), (28, 0.009), (29, -0.05), (30, 0.017), (31, 0.039), (32, 0.024), (33, 0.009), (34, -0.048), (35, -0.005), (36, 0.024), (37, -0.02), (38, 0.024), (39, 0.099), (40, -0.029), (41, 0.07), (42, 0.048), (43, -0.007), (44, 0.056), (45, -0.042), (46, -0.013), (47, 0.016), (48, 0.023), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92514217 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

Author: Oliver Schneider ; Alex Garnett

2 0.65113348 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

3 0.6416871 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

4 0.62756091 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard

Author: Tony Veale ; Yanfen Hao

Abstract: Large lexical resources, such as corpora and databases of Web ngrams, are a rich source of pre-fabricated phrases that can be reused in many different contexts. However, one must be careful in how these resources are used, and noted writers such as George Orwell have argued that the use of canned phrases encourages sloppy thinking and results in poor communication. Nonetheless, while Orwell prized home-made phrases over the readymade variety, there is a vibrant movement in modern art which shifts artistic creation from the production of novel artifacts to the clever reuse of readymades or objets trouvés. We describe here a system that makes creative reuse of the linguistic readymades in the Google ngrams. Our system, the Jigsaw Bard, thus owes more to Marcel Duchamp than to George Orwell. We demonstrate how textual readymades can be identified and harvested on a large scale, and used to drive a modest form of linguistic creativity. 1

5 0.61353886 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

Author: Svetlana Kiritchenko ; Colin Cherry

Abstract: The automatic coding of clinical documents is an important task for today’s healthcare providers. Though it can be viewed as multi-label document classification, the coding problem has the interesting property that most code assignments can be supported by a single phrase found in the input document. We propose a Lexically-Triggered Hidden Markov Model (LT-HMM) that leverages these phrases to improve coding accuracy. The LT-HMM works in two stages: first, a lexical match is performed against a term dictionary to collect a set of candidate codes for a document. Next, a discriminative HMM selects the best subset of codes to assign to the document by tagging candidates as present or absent. By confirming codes proposed by a dictionary, the LT-HMM can share features across codes, enabling strong performance even on rare codes. In fact, we are able to recover codes that do not occur in the training set at all. Our approach achieves the best ever performance on the 2007 Medical NLP Challenge test set, with an F-measure of 89.84.

6 0.60633165 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

7 0.60613441 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

8 0.60117739 115 acl-2011-Engkoo: Mining the Web for Language Learning

9 0.59739631 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

10 0.58620715 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

11 0.58495998 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

12 0.57320708 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

13 0.56672144 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity

14 0.56595069 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

15 0.55738729 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

16 0.54810649 291 acl-2011-SystemT: A Declarative Information Extraction System

17 0.54629886 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

18 0.54418039 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

19 0.5372529 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

20 0.52796936 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (17, 0.037), (26, 0.02), (37, 0.04), (39, 0.024), (41, 0.05), (59, 0.033), (72, 0.014), (91, 0.547), (96, 0.129)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88999629 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

Author: Oliver Schneider ; Alex Garnett

2 0.8583498 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

3 0.84820414 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

Author: Tetsuo Kiso ; Masashi Shimbo ; Mamoru Komachi ; Yuji Matsumoto

Abstract: In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select seeds and create a stop list using the rankings of instances and patterns computed by Kleinberg’s HITS algorithm. Experimental results on a variation of the lexical sample task show the effectiveness of our method.

4 0.81431293 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth

Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.

5 0.75826991 313 acl-2011-Two Easy Improvements to Lexical Weighting

Author: David Chiang ; Steve DeNeefe ; Michael Pust

Abstract: We introduce two simple improvements to the lexical weighting features of Koehn, Och, and Marcu (2003) for machine translation: one which smooths the probability of translating word f to word e by simplifying English morphology, and one which conditions it on the kind of training data that f and e co-occurred in. These new variations lead to improvements of up to +0.8 BLEU, with an average improvement of +0.6 BLEU across two language pairs, two genres, and two translation systems.

6 0.64912349 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

7 0.5531674 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

8 0.53634346 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons

9 0.53460824 239 acl-2011-P11-5002 k2opt.pdf

10 0.53020549 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

11 0.49061716 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

12 0.49026591 177 acl-2011-Interactive Group Suggesting for Twitter

13 0.4844622 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

14 0.48389101 200 acl-2011-Learning Dependency-Based Compositional Semantics

15 0.46339357 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

16 0.45662871 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

17 0.44113904 117 acl-2011-Entity Set Expansion using Topic information

18 0.4407025 74 acl-2011-Combining Indicators of Allophony

19 0.43911397 174 acl-2011-Insights from Network Structure for Text Mining

20 0.43818343 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results