acl acl2011 acl2011-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Luc Boruta
Abstract: Allophonic rules are responsible for the great variety in phoneme realizations. Infants can not reliably infer abstract word representations without knowledge of their native allophonic grammar. We explore the hypothesis that some properties of infants’ input, referred to as indicators, are correlated with allophony. First, we provide an extensive evaluation of individual indicators that rely on distributional or lexical information. Then, we present a first evaluation of the combination of indicators of different types, considering both logical and numerical combinations schemes. Though distributional and lexical indicators are not redundant, straightforward combinations do not outperform individual indicators.
Reference: text
sentIndex sentText sentNum sentScore
1 Paris Diderot, Sorbonne Paris Cit e´, ALPAGE, UMR-I 001 INRIA, F-75205, Paris, France LSCP, D ´epartement d’E´tudes Cognitives, E´cole Normale Sup e´rieure, F-75005, Paris, France luc . [sent-2, score-0.06]
2 fr Abstract Allophonic rules are responsible for the great variety in phoneme realizations. [sent-4, score-0.059]
3 Infants can not reliably infer abstract word representations without knowledge of their native allophonic grammar. [sent-5, score-0.599]
4 First, we provide an extensive evaluation of individual indicators that rely on distributional or lexical information. [sent-7, score-0.57]
5 Then, we present a first evaluation of the combination of indicators of different types, considering both logical and numerical combinations schemes. [sent-8, score-0.579]
6 Though distributional and lexical indicators are not redundant, straightforward combinations do not outperform individual indicators. [sent-9, score-0.602]
7 1 Introduction Though the phonemic inventory of a language is typically small, phonetic and phonological processes yield manifold variants1 for each phoneme. [sent-10, score-0.093]
8 Words too are affected by this variability, yielding different realizations for a given underlying form. [sent-11, score-0.057]
9 Allophonic rules relate phonemes to their variants, expressing the contexts in which the latter occur. [sent-12, score-0.062]
10 We are interested in describing procedures by which infants, learning their native allophonic grammar, could reduce the variation and recover words. [sent-13, score-0.599]
11 Combining insights from both computational and behavioral studies, we endorse the hypothesis that infants are good distributional learners (Maye et al. [sent-14, score-0.228]
12 1We use allophony as an umbrella term for the continuum ranging from typical allophones to mere coarticulatory variants. [sent-17, score-0.209]
13 88 We seek to identify which features of infants’ input are most reliable for learning allophonic rules. [sent-18, score-0.599]
14 A few indicators, based on distributional (Peperkamp et al. [sent-19, score-0.108]
15 the question of whether or not these indicators capture different aspects of allophony and, if so, which combination scheme yields better results. [sent-24, score-0.581]
16 We present an extensive evaluation of individual indicators and, based on theoretical and empirical desiderata, we outline a more comprehensive framework to model the acquisition of allophonic rules. [sent-25, score-1.074]
17 2 Indicators of allophony We build upon Peperkamp et al. [sent-26, score-0.103]
18 In line with previous studies, we assume that infants are able to segment the continuous stream of acoustic input into a sequence of discrete segments, and that they quantize each of these segments into one of a finite number of phonetic categories. [sent-32, score-0.334]
19 However, the larger the set of phonetic categories, the closer we get to recent ‘single-stage’ approaches (e. [sent-34, score-0.06]
20 2See also the work of Dautriche (2009) on acoustic indicators of allophony, albeit using adult-directed speech. [sent-38, score-0.454]
21 1 Distributional indicators Complementary distribution is a ubiquitous criterion for the discovery of phonemes. [sent-42, score-0.414]
22 If two segments occur in mutually exclusive contexts, the two may be realizations of the same phoneme. [sent-43, score-0.146]
23 2 Lexical indicators Adjacent segments can condition the realization of a word’s initial and final phonemes. [sent-49, score-0.503]
24 If two words only differ by their initial or final segments, these segments may be realizations of the same phoneme. [sent-50, score-0.146]
25 Instantiating the general concept of functional load (Hockett, 1955), lexical indicators gauge the degree of contrast in the lexicon between two segments. [sent-51, score-0.508]
26 (submitted) defined a Boolean-valued indicator, FL, satisfied by a single pair of minimally different words. [sent-53, score-0.06]
27 HFL accounts for the fraction of information content, represented by the language’s word entropy, that is lost when the opposition between two segments is neutralized. [sent-59, score-0.089]
28 They created a range of possible inputs, applying artificial allophonic grammars4 of different sizes (Boruta, 2011) to the now-standard CHILDES ‘Brent/Ratner’ corpus of English (Brent and Cartwright, 1996). [sent-65, score-0.599]
29 We quantify the amount of variation in a corpus by its allophonic complexity, i. [sent-66, score-0.599]
30 the ratio of the number of phones to the number of phonemes in the language. [sent-68, score-0.062]
31 Lexical indicators require an ancillary procedure yielding a lexicon. [sent-69, score-0.414]
32 Venkataraman’s incremental (2001) model, using the unsegmented phonetic cor- pora as the input. [sent-73, score-0.06]
33 Though, obviously, infants can not access it, we use the lexicon derived from the CHILDES orthographic transcripts for reference. [sent-74, score-0.166]
34 4 Indicators’ discriminant power As the aforementioned indicators have been evaluated using various languages, allophonic grammars and measures, we present a unified evaluation, conducted using Sing et al. [sent-75, score-1.057]
35 1 Evaluation Non-Boolean indicators require a threshold at and above which pairs are classified as allophonic. [sent-78, score-0.437]
36 We evaluate indicators across all possible discrimination thresholds, reporting the area under the ROC curve (henceforth AUC). [sent-79, score-0.472]
37 ’s ρ, values lie in [0, 1] and are equal to the probability that a randomly drawn allophonic pair will score higher than a randomly drawn non-allophonic pair; . [sent-81, score-0.624]
38 Moreover, we evaluate indicators’ misclassifications at the discrimination threshold maximizing Matthews’ (1975) correlation coefficient: let α, β, γ and δ be, respectively, the number of false positives, false negatives, truep positives and true negatives, MCC (γδ−αβ)/p(α+γ)(β+γ)(α+δ)(β+δ). [sent-83, score-0.352]
39 eTrfheicst ,c roaenffdiocimen at ids more appropriate than the accuracy or the F-measure 4Because all allophonic rules implemented in the corpora are of the type p → a / c, FL and FL* only look for words minimally differing by t/heir last segments. [sent-85, score-0.662]
40 5 Using this optimal, MCC-maximizing threshold, we report the maximal MCC and, as percentages, the accuracy (Acc), the false positive rate (FPR) and the false negative rate (FNR). [sent-87, score-0.18]
41 2 Results and discussion Indicators’ AUC corroborate previous results for distributional indicators: they perform almost identically and do not accommodate high allophonic complexities at which they perform below chance (Figure 1. [sent-89, score-0.757]
42 , every segment has an extremely narrow distribution and complementary distribution is the rule rather than the exception. [sent-91, score-0.06]
43 By contrast, all three lexical indicators are much more robust even if, as predicted, FL’s coarseness impedes its discriminant power (Figure 1. [sent-92, score-0.464]
44 Thus, misclassification scores are reported in Table 1 only at low (2 allophones/phoneme) and medium (9) complexities. [sent-101, score-0.101]
45 Previous observations are confirmed by MCC and accuracy values: though all indicators are positively correlated with the underlying allophonic relation, correlation is stronger for lexical indicators. [sent-102, score-1.038]
46 Surprisingly, zero FPR values are observed for some lexical indicators, meaning that they make no false alarms and, as a consequence, that all errors are caused by missed allophonic pairs. [sent-103, score-0.735]
47 5 Indicators’ redundancy None of the indicators we benchmarked in the previous section makes a perfect discrimination between allophonic and non-allophonic pairs of segments. [sent-104, score-1.126]
48 /B(epsiad−es,1 )th,e w computation eofr precision, recall and the F-measure do not take true negatives into account. [sent-106, score-0.071]
49 6These indicators perform similarly using the orthographic lexicon: we only report AUC for FL* (referred to as oFL*), as it gives the upper bound on lexical indicators’ performance. [sent-107, score-0.457]
50 Lexical Figure 1: Indicators’ AUC as a function of allophonic complexity. [sent-110, score-0.599]
51 7 Table 1: Indicators’ performance at low and medium complexities, using the MCC-maximizing thresholds. [sent-160, score-0.071]
52 Italics indicate accuracies below that of a dummy indicator rejecting all pairs. [sent-162, score-0.129]
53 Yet, if some segment pairs are misclassified by one but not all (types of) indicators, a suitable combination should outperform individual indicators. [sent-163, score-0.127]
54 In other words, combining indicators may yield better results only if, individually, indicators capture different subsets of the underlying allophonic relation. [sent-164, score-1.427]
55 1 Evaluation To get a straightforward we compute the Jaccard tor’s set of misclassified containing, respectively, estimation of redundancy, index between each indicapairs: let D and L be sets a distributional and a lexi- cal indicator’s errors, J(D, L) = |D ∩ L|/ |D ∪ L|. [sent-166, score-0.13]
56 To distinguish false positives from false negatives, we compute two Jaccard indices for each possible combination. [sent-168, score-0.279]
57 2 Results and discussion Jaccard indices, reported in Table 2, emphasize the distinction between false positives and false negatives. [sent-170, score-0.242]
58 False negatives have rather high indices: most allophonic pairs that are not captured by distributional indicators are not captured either by lexical indicators, and vice versa. [sent-171, score-1.24]
59 By contrast, there is little or no redundancy in false positives, even at medium allophonic complexity: though random pairs can be incorrectly classified as allophonic, the error is unlikely to recur across all types of indicators. [sent-172, score-0.815]
60 It is also worth noting that though JS performs slightly better than KL and BC, the exact nature of the distributional indicator seems to have little influence on the performance of the combination. [sent-173, score-0.175]
61 6 Combining indicators As distributional and lexical indicators are not com- pletely redundant, combining them is a natural extension. [sent-174, score-0.961]
62 However, not all conceivable combination schemes are appropriate for our task. [sent-175, score-0.061]
63 At the computational level, a combination scheme can be either disjunctive or conjunctive, i. [sent-177, score-0.064]
64 each indicator can be either sufficient or (only) necessary. [sent-179, score-0.067]
65 Aforementioned indicators were designed as necessary but not sufficient correlates of phonemehood. [sent-180, score-0.414]
66 For instance, while a phoneme’s allophones have complementary distributions, not all segments that have complementary distributions are allophones of a single phoneme. [sent-181, score-0.393]
67 Therefore, we favor a conjunctive scheme,7 even if this conflicts with abovementioned results: most errors are due to missed allophonic pairs but a conjunctive scheme, where every indicator must be satisfied, is likely to increase misses. [sent-182, score-0.778]
68 At the algorithmic level, a combination scheme can be either logical or numerical. [sent-183, score-0.119]
69 A logical scheme uses a logical connective to join indicators’ Boolean decisions, typically by conjunction according to our previous decision. [sent-184, score-0.163]
70 By contrast, a numerical scheme tries to approximate interactions between indicators’ values, merging them using any monotone increas- ing function; discrimination then relies on a single threshold. [sent-185, score-0.132]
71 In practical terms, we use multiplication as a numerical counterpart of conjunction. [sent-186, score-0.086]
72 787 Table 2: Indicators’ redundancy at low and medium allophonic complexities, estimated by the Jaccard indices between their false positives (FP) and false negatives (FN). [sent-231, score-1.052]
73 Figure 2: Indicators’ AUC as a function of allophonic complexity, for the multiplicative combination scheme. [sent-233, score-0.657]
74 Logical combinations require one discrimination threshold per combined indicator. [sent-235, score-0.09]
75 As it facilitates comparison with previous results, we report performance at the thresholds maximizing the MCC of individual indicators (rather than at the thresholds maximizing the combined MCC) . [sent-236, score-0.537]
76 Equal contribution of all indicators may or may not be a desirable property, but in the absence of a priori knowledge of indicators’ relative weights, each indicator’s values were standardized so that they lie in [0, 1], shifting the minimum to zero and rescaling by the range. [sent-238, score-0.439]
77 2 Results and discussion It is worth noting that, while the performance of combined indicators is still good (Table 3), it is less satisfactory than that of the best individual indicators. [sent-240, score-0.458]
78 Moreover, even if misclassification scores Logical combination: conjunction Numerical combination: multiplication 2 allophones/phoneme 9 allophones/phoneme 2 allophones/phoneme 9 allophones/phoneme MCC Acc FPR FNR MCC Acc FPR FNR MCC Acc FPR FNR MCC Acc FPR FNR KL FL JS FL BC FL . [sent-241, score-0.095]
79 2 Table 3: Performance of combined distributional and lexical indicators, at low and medium allophonic complexity. [sent-385, score-0.803]
80 Italics indicate accuracies below that of a dummy indicator rejecting all pairs. [sent-387, score-0.129]
81 show that conjoined and multiplied indicators perform similarly, disparities emerge at medium allophonic complexity: while multiplication yields bet- ter MCC and FNR, conjunction yields better accuracy and FPR. [sent-388, score-1.149]
82 In that regard, observing FPR values of zero is quite satisfactory from the point of view of language acquisition, as processing two segments as realizations of a single phoneme (while they are not) may lead to the confusion of true minimal pairs of words. [sent-389, score-0.249]
83 Indeed, at a higher level, learning allophonic rules allows the infant to reduce the size of its emerging lexicon, factoring out allophonic realizations for each underlying word form. [sent-390, score-1.306]
84 Furthermore, AUC curves for the multiplicative scheme (Figure 2),8 most notably FL’s, suggest that distributional indicators’ contribution to the combinations appears to be rather negative, except at very low allophonic complexities. [sent-391, score-0.793]
85 One explanation (yet to be tested experimentally) would be that they come into play later in the learning process, once part of allophony has been reduced using other indicators. [sent-392, score-0.103]
86 7 Conclusion We presented an evaluation of distributional and lexical indicators of allophony. [sent-393, score-0.547]
87 Although they all perform well at low allophonic complexities, misclassifications increase, more or less seriously, when 8We do not report a threshold-free evaluation for the logical scheme. [sent-394, score-0.682]
88 Moreover, as the exact definition of the distributional indicator does not affect the results, we only plot combinations with JS. [sent-396, score-0.207]
89 92 the average number of allophones per phoneme increases. [sent-397, score-0.165]
90 We also presented a first evaluation of the combination of indicators, and found no significant difference between the two combination schemes we defined. [sent-398, score-0.095]
91 Unfortunately, none of the combinations we tested outperforms individual indicators. [sent-399, score-0.055]
92 For comparability with previous studies, we only considered combination schemes requiring no modification in the definition of the task; however, learning allophonic pairs becomes unnatural when phonemes can have more than two realizations. [sent-400, score-0.745]
93 Embedding each indicator’s segment-to-segment (dis)similarities in a multidimensional space, for example, would enable the use ofclustering techniques where minimally distant points would be analyzed as allophones of a single phoneme. [sent-401, score-0.147]
94 Thus far, segments have been nothing but abstract symbols and, for example, the task at hand is as hard for [a] ∼ [a] as it is for [4] ∼ [k]. [sent-402, score-0.089]
95 However, nhaotr only d[ao] allop˚hones o isf a given phoneme otewnedv etro, be acoustically similar, but acoustic differences may be more salient and/or available earlier to the infant than complementary distributions or minimally differing words. [sent-403, score-0.27]
96 Therefore, the main extension towards a comprehensive model of the acquisition of allophonic rules would be to include acoustic indicators. [sent-404, score-0.677]
97 Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. [sent-409, score-0.06]
98 Mod e´lisation des processus d’acquisition du langage par des m ´ethodes statistiques. [sent-428, score-0.056]
99 Infant sensitivity to distributional information can affect phonetic discrimination. [sent-461, score-0.168]
100 The acquisition of allophonic rules: statistical learning with linguistic constraints. [sent-465, score-0.637]
wordName wordTfidf (topN-words)
[('allophonic', 0.599), ('indicators', 0.414), ('fl', 0.275), ('mcc', 0.205), ('hfl', 0.188), ('fpr', 0.154), ('fnr', 0.121), ('infants', 0.12), ('peperkamp', 0.12), ('distributional', 0.108), ('allophones', 0.106), ('bc', 0.103), ('allophony', 0.103), ('js', 0.093), ('false', 0.09), ('segments', 0.089), ('kl', 0.085), ('acc', 0.083), ('auc', 0.074), ('negatives', 0.071), ('medium', 0.071), ('boruta', 0.068), ('indicator', 0.067), ('positives', 0.062), ('phonemes', 0.062), ('phonetic', 0.06), ('jaccard', 0.06), ('luc', 0.06), ('phoneme', 0.059), ('discrimination', 0.058), ('realizations', 0.057), ('emmanuel', 0.056), ('logical', 0.055), ('calvez', 0.051), ('infant', 0.051), ('complexities', 0.05), ('numerical', 0.044), ('martin', 0.044), ('multiplication', 0.042), ('minimally', 0.041), ('acoustic', 0.04), ('paris', 0.04), ('acquisition', 0.038), ('indices', 0.037), ('complementary', 0.035), ('sharon', 0.035), ('coolen', 0.034), ('hockett', 0.034), ('maye', 0.034), ('rocr', 0.034), ('rozenn', 0.034), ('saffran', 0.034), ('conjunctive', 0.034), ('combination', 0.034), ('phonological', 0.033), ('redundancy', 0.032), ('dummy', 0.032), ('combinations', 0.032), ('fp', 0.03), ('childes', 0.03), ('dillon', 0.03), ('misclassification', 0.03), ('rejecting', 0.03), ('typewriter', 0.03), ('scheme', 0.03), ('boldface', 0.029), ('lexicon', 0.028), ('des', 0.028), ('misclassifications', 0.028), ('inria', 0.028), ('cognition', 0.028), ('schemes', 0.027), ('fn', 0.026), ('thresholds', 0.026), ('brent', 0.026), ('sing', 0.026), ('lie', 0.025), ('segment', 0.025), ('lexical', 0.025), ('discriminant', 0.025), ('crabb', 0.025), ('maximizing', 0.024), ('multiplicative', 0.024), ('beno', 0.024), ('individual', 0.023), ('submitted', 0.023), ('pairs', 0.023), ('conjunction', 0.023), ('distributions', 0.022), ('misclassified', 0.022), ('differing', 0.022), ('italics', 0.022), ('missed', 0.021), ('satisfactory', 0.021), ('functional', 0.021), ('load', 0.02), ('le', 0.019), ('satisfied', 0.019), ('aforementioned', 0.019), ('orthographic', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 74 acl-2011-Combining Indicators of Allophony
Author: Luc Boruta
Abstract: Allophonic rules are responsible for the great variety in phoneme realizations. Infants can not reliably infer abstract word representations without knowledge of their native allophonic grammar. We explore the hypothesis that some properties of infants’ input, referred to as indicators, are correlated with allophony. First, we provide an extensive evaluation of individual indicators that rely on distributional or lexical information. Then, we present a first evaluation of the combination of indicators of different types, considering both logical and numerical combinations schemes. Though distributional and lexical indicators are not redundant, straightforward combinations do not outperform individual indicators.
2 0.062590986 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
Author: Harr Chen ; Edward Benson ; Tahira Naseem ; Regina Barzilay
Abstract: We present a novel approach to discovering relations and their instantiations from a collection of documents in a single domain. Our approach learns relation types by exploiting meta-constraints that characterize the general qualities of a good relation in any domain. These constraints state that instances of a single relation should exhibit regularities at multiple levels of linguistic structure, including lexicography, syntax, and document-level context. We capture these regularities via the structure of our probabilistic model as well as a set of declaratively-specified constraints enforced during posterior inference. Across two domains our approach successfully recovers hidden relation structure, comparable to or outperforming previous state-of-the-art approaches. Furthermore, we find that a small , set of constraints is applicable across the domains, and that using domain-specific constraints can further improve performance. 1
3 0.051936321 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature
Author: Manaal Faruqui ; Sebastian Pado
Abstract: In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal (“T/V”) address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.
4 0.034820065 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
Author: Emmanuel Prochasson ; Pascale Fung
Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.
5 0.033882104 109 acl-2011-Effective Measures of Domain Similarity for Parsing
Author: Barbara Plank ; Gertjan van Noord
Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.
6 0.033043783 333 acl-2011-Web-Scale Features for Full-Scale Parsing
7 0.032514568 200 acl-2011-Learning Dependency-Based Compositional Semantics
8 0.031081636 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
9 0.030218268 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
10 0.028857561 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
11 0.028791001 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
12 0.028700553 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
13 0.027704436 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
14 0.027655046 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
15 0.027542096 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
16 0.027341731 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
17 0.027302222 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
18 0.0267017 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
19 0.026631095 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations
20 0.025949052 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
topicId topicWeight
[(0, 0.075), (1, 0.007), (2, -0.016), (3, 0.007), (4, -0.015), (5, 0.009), (6, 0.031), (7, 0.005), (8, -0.015), (9, 0.001), (10, -0.009), (11, 0.002), (12, -0.005), (13, 0.037), (14, 0.012), (15, -0.016), (16, -0.036), (17, -0.027), (18, -0.009), (19, 0.004), (20, 0.032), (21, -0.014), (22, -0.028), (23, -0.003), (24, -0.029), (25, 0.016), (26, 0.026), (27, 0.023), (28, 0.017), (29, -0.012), (30, -0.001), (31, 0.021), (32, 0.021), (33, 0.03), (34, 0.016), (35, 0.024), (36, -0.019), (37, 0.014), (38, 0.006), (39, -0.008), (40, 0.003), (41, 0.01), (42, -0.001), (43, -0.046), (44, 0.003), (45, -0.011), (46, -0.0), (47, 0.025), (48, -0.069), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.89302868 74 acl-2011-Combining Indicators of Allophony
Author: Luc Boruta
Abstract: Allophonic rules are responsible for the great variety in phoneme realizations. Infants can not reliably infer abstract word representations without knowledge of their native allophonic grammar. We explore the hypothesis that some properties of infants’ input, referred to as indicators, are correlated with allophony. First, we provide an extensive evaluation of individual indicators that rely on distributional or lexical information. Then, we present a first evaluation of the combination of indicators of different types, considering both logical and numerical combinations schemes. Though distributional and lexical indicators are not redundant, straightforward combinations do not outperform individual indicators.
2 0.64011776 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
Author: Youngjun Kim ; Ellen Riloff ; Stephane Meystre
Abstract: We present an NLP system that classifies the assertion type of medical problems in clinical notes used for the Fourth i2b2/VA Challenge. Our classifier uses a variety of linguistic features, including lexical, syntactic, lexicosyntactic, and contextual features. To overcome an extremely unbalanced distribution of assertion types in the data set, we focused our efforts on adding features specifically to improve the performance of minority classes. As a result, our system reached 94. 17% micro-averaged and 79.76% macro-averaged F1-measures, and showed substantial recall gains on the minority classes. 1
3 0.57895994 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben
Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.
4 0.56920487 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
5 0.5691334 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
Author: Elijah Mayfield ; Carolyn Penstein Rose
Abstract: We present a novel computational formulation of speaker authority in discourse. This notion, which focuses on how speakers position themselves relative to each other in discourse, is first developed into a reliable coding scheme (0.71 agreement between human annotators). We also provide a computational model for automatically annotating text using this coding scheme, using supervised learning enhanced by constraints implemented with Integer Linear Programming. We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).
6 0.55987227 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations
7 0.55811828 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
8 0.55465996 133 acl-2011-Extracting Social Power Relationships from Natural Language
9 0.55304754 297 acl-2011-That's What She Said: Double Entendre Identification
10 0.55291325 194 acl-2011-Language Use: What can it tell us?
11 0.54240102 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
12 0.53411913 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature
13 0.51930135 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
14 0.51837516 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
15 0.51716077 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging
16 0.51170945 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
17 0.50988245 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
18 0.5098033 252 acl-2011-Prototyping virtual instructors from human-human corpora
19 0.50938851 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
20 0.5060581 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
topicId topicWeight
[(1, 0.012), (5, 0.033), (15, 0.011), (17, 0.052), (26, 0.03), (37, 0.055), (39, 0.05), (41, 0.046), (55, 0.037), (59, 0.046), (62, 0.349), (72, 0.022), (91, 0.066), (96, 0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.7898162 74 acl-2011-Combining Indicators of Allophony
Author: Luc Boruta
Abstract: Allophonic rules are responsible for the great variety in phoneme realizations. Infants can not reliably infer abstract word representations without knowledge of their native allophonic grammar. We explore the hypothesis that some properties of infants’ input, referred to as indicators, are correlated with allophony. First, we provide an extensive evaluation of individual indicators that rely on distributional or lexical information. Then, we present a first evaluation of the combination of indicators of different types, considering both logical and numerical combinations schemes. Though distributional and lexical indicators are not redundant, straightforward combinations do not outperform individual indicators.
2 0.71334743 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
Author: Daniel Bar ; Nicolai Erbs ; Torsten Zesch ; Iryna Gurevych
Abstract: We present Wikulu1, a system focusing on supporting wiki users with their everyday tasks by means of an intelligent interface. Wikulu is implemented as an extensible architecture which transparently integrates natural language processing (NLP) techniques with wikis. It is designed to be deployed with any wiki platform, and the current prototype integrates a wide range of NLP algorithms such as keyphrase extraction, link discovery, text segmentation, summarization, or text similarity. Additionally, we show how Wikulu can be applied for visually analyzing the results of NLP algorithms, educational purposes, and enabling semantic wikis.
3 0.59817225 217 acl-2011-Machine Translation System Combination by Confusion Forest
Author: Taro Watanabe ; Eiichiro Sumita
Abstract: The state-of-the-art system combination method for machine translation (MT) is based on confusion networks constructed by aligning hypotheses with regard to word similarities. We introduce a novel system combination framework in which hypotheses are encoded as a confusion forest, a packed forest representing alternative trees. The forest is generated using syntactic consensus among parsed hypotheses: First, MT outputs are parsed. Second, a context free grammar is learned by extracting a set of rules that constitute the parse trees. Third, a packed forest is generated starting from the root symbol of the extracted grammar through non-terminal rewriting. The new hypothesis is produced by searching the best derivation in the forest. Experimental results on the WMT10 system combination shared task yield comparable performance to the conventional confusion network based method with smaller space.
4 0.55300087 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
Author: Chris Dyer ; Jonathan H. Clark ; Alon Lavie ; Noah A. Smith
Abstract: We introduce a discriminatively trained, globally normalized, log-linear variant of the lexical translation models proposed by Brown et al. (1993). In our model, arbitrary, nonindependent features may be freely incorporated, thereby overcoming the inherent limitation of generative models, which require that features be sensitive to the conditional independencies of the generative process. However, unlike previous work on discriminative modeling of word alignment (which also permits the use of arbitrary features), the parameters in our models are learned from unannotated parallel sentences, rather than from supervised word alignments. Using a variety of intrinsic and extrinsic measures, including translation performance, we show our model yields better alignments than generative baselines in a number of language pairs.
5 0.52779925 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
Author: Jianxing Yu ; Zheng-Jun Zha ; Meng Wang ; Tat-Seng Chua
Abstract: In this paper, we dedicate to the topic of aspect ranking, which aims to automatically identify important product aspects from online consumer reviews. The important aspects are identified according to two observations: (a) the important aspects of a product are usually commented by a large number of consumers; and (b) consumers’ opinions on the important aspects greatly influence their overall opinions on the product. In particular, given consumer reviews of a product, we first identify the product aspects by a shallow dependency parser and determine consumers’ opinions on these aspects via a sentiment classifier. We then develop an aspect ranking algorithm to identify the important aspects by simultaneously considering the aspect frequency and the influence of consumers’ opinions given to each aspect on their overall opinions. The experimental results on 11 popular products in four domains demonstrate the effectiveness of our approach. We further apply the aspect ranking results to the application ofdocumentlevel sentiment classification, and improve the performance significantly.
6 0.41872254 121 acl-2011-Event Discovery in Social Media Feeds
7 0.40511006 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
8 0.40435356 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
9 0.39869061 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
10 0.39822 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
11 0.39686954 154 acl-2011-How to train your multi bottom-up tree transducer
12 0.3964498 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
13 0.39613855 30 acl-2011-Adjoining Tree-to-String Translation
14 0.39602947 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
15 0.39518934 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
16 0.39509669 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
17 0.39468664 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
18 0.39453766 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
19 0.39348671 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
20 0.3933827 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization