acl acl2012 acl2012-42 knowledge-graph by maker-knowledge-mining

42 acl-2012-Bootstrapping via Graph Propagation

Source: pdf

Author: Max Whitney ; Anoop Sarkar

Abstract: Bootstrapping a classifier from a small set of seed rules can be viewed as the propagation of labels between examples via features shared between them. This paper introduces a novel variant of the Yarowsky algorithm based on this view. It is a bootstrapping learning method which uses a graph propagation algorithm with a well defined objective function. The experimental results show that our proposed bootstrapping algorithm achieves state of the art performance or better on several different natural language data sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Bootstrapping a classifier from a small set of seed rules can be viewed as the propagation of labels between examples via features shared between them. [sent-2, score-0.505]

2 It is a bootstrapping learning method which uses a graph propagation algorithm with a well defined objective function. [sent-4, score-0.684]

3 The experimental results show that our proposed bootstrapping algorithm achieves state of the art performance or better on several different natural language data sets. [sent-5, score-0.207]

4 1 Introduction In this paper, we are concerned with a case of semisupervised learning that is close to unsupervised learning, in that the labelled and unlabelled data points are from the same domain and only a small set of seed rules is used to derive the labelled points. [sent-6, score-0.614]

5 In contrast, typical semi-supervised learning deals with a large number of labelled points, and a domain adaptation task with unlabelled points from the new domain. [sent-8, score-0.254]

6 In this paper we focus on a self-training style bootstrapping algorithm, the Yarowsky algorithm (Yarowsky, 1995). [sent-10, score-0.207]

7 Variants of this algorithm have been formalized as optimizing an objective function in previous work by Abney (2004) and Haffari and Sarkar (2007), but it is not clear that any perform as well as the Yarowsky algorithm itself. [sent-11, score-0.269]

8 To our knowledge, this is the first theoretically motivated self-training bootstrapping algorithm which performs as well as the Yarowsky algorithm. [sent-19, score-0.237]

9 In the bootstrapping setting the learner is given an initial partial labelling where only a few examples are Y(0) Yx(0) = ⊥ for most x). [sent-24, score-0.414]

10 Abney (2004) defines three probability distributions in his analysis of bootstrapping: θfj is the parameter for feature f with label j, taken to be normalized so that θf is a distribution over labels. [sent-27, score-0.154]

11 φx is the labelling distribution representing the current Y ; it is a point distribution for labelled examples and uniform for unlabelled examples. [sent-28, score-0.635]

12 Lafferty (2003)’s method of graph propagation is predominantly transductive, and the non-transductive version is closely related to Abney (2004) c. [sent-39, score-0.415]

13 The basic Yarowsky algorithm ∝is shown in algorithm 1. [sent-46, score-0.176]

14 Note that at any point some training examples may be left unlabelled by Y(t) . [sent-47, score-0.189]

15 We add the seed DL to the new DL after applying the cautious pruning. [sent-66, score-0.324]

16 At the final iteration Yarowsky-cautious uses the current labelling to train a DL without a threshold or cautiousness, and this DL is used for testing. [sent-68, score-0.337]

17 At each step it trains a DL and then produces a new labelling for the other DL. [sent-73, score-0.24]

18 Each DL uses thresholding and cautiousness as we describe for Yarowsky-cautious. [sent-74, score-0.228]

19 At the end the DLs are combined, the result is used to label the data, and a retraining step is done from this single labelling. [sent-75, score-0.227]

20 Besides various changes in the specifics of how the labelling is produced, this algorithm has two differences versus Yarowsky. [sent-78, score-0.3]

21 Based on similar parts of DLCoTrain we assume the that the top n selection is per label rather in total, that the thresholding value is unsmoothed, and that there is a retraining step. [sent-85, score-0.228]

22 (2007) provide an objective function for this algorithm using a generalized definition of crossentropy in terms of Bregman distance, which motivates our objective in section 4. [sent-87, score-0.31]

23 It is the same as Yarowsky except that we use the sum definition when labelling: for example x we choose the label j with the highest (sum) πx(j), but set Yx = ⊥ if the sum is zero. [sent-94, score-0.192]

24 6 Bipartite graph algorithms Haffari and Sarkar (2007) suggest a bipartite graph framework for semi-supervised learning based on their analysis of Y-1/DL-1-VS and objective (2). [sent-98, score-0.518]

25 The graph has vertices X ∪ F and edges {(x, f) : x ∈ X, f ∈ Fx}, as isn X Xthe ∪ graph s hedogwens i{n( figure 1(a). [sent-99, score-0.334]

26 EXa,cfh ∈ve Frtex}, represents a rdaipsthri sbhuotiwonn over labels, and in this view Yarowsky can be seen as alternately updating the example distributions based on the feature distributions and visa versa. [sent-100, score-0.158]

27 Each can be one of two choices: average(S) is the normalized average of the distributions of S, while majority(S) is a uniform distribution if all labels are supported by equal numbers of distributions of S, and otherwise a point distribution with mass on the best supported label. [sent-103, score-0.244]

28 In our implementation we label training data (for the convergence check) with the φ distributions from the graph. [sent-106, score-0.166]

29 Unlike the algorithms described above, it is for domain adaptation with large amounts of labelled data rather than bootstrapping with a small number of seeds. [sent-112, score-0.322]

30 This algorithm is structurally similar to Yarowsky in that it begins from an initial partial labelling and repeatedly trains a classifier on the labelling and then relabels the data. [sent-113, score-0.544]

31 It differs significantly from Yarowsky in two other ways: First, instead of only training a CRF it also uses a step of graph propagation between distributions over the n-grams in the data. [sent-116, score-0.491]

32 Second, it does the propagation on distributions over n-gram types rather than over n-gram tokens (instances in the data). [sent-117, score-0.316]

33 They argue that using propagation over types allows the algorithm to enforce constraints and find similarities that self-training cannot. [sent-118, score-0.325]

34 We are not concerned here with the details of this algorithm, but it motivates our work firstly in providing the graph propagation which we will describe in more detail in section 4, and secondly in providing an algorithmic structure that we use for our algorithm in section 5. [sent-119, score-0.501]

35 x=q,u fθT= θf They do not specify tuning details, but to get com- µ parable accuracy we found it was necessary to do smoothing and to include weights λ1 and λ2 on the expected counts of seed-labelled and initially unlabelled examples respectively (Nigam et al. [sent-123, score-0.298]

36 4 Graph propagation The graph propagation of Subramanya et al. [sent-125, score-0.621]

37 We propose four methods for using this propagation with Yarowsky. [sent-132, score-0.237]

38 The distributions and graph structures are shown in table 2. [sent-134, score-0.226]

39 The graph structure of φ-θ is the bipartite graph of Haffari and Sarkar (2007). [sent-137, score-0.37]

40 In fact, φ-θ the propagation objective (3) and Haffari and Sarkar (2007)’s Y-1/DL-1-VS objective (2) are identical up to con- stant coefficients and an extra constant term. [sent-138, score-0.501]

41 Since wuv = wvu and Bt2 (qu , qv ) = Bt2 (qv , qu), this can be folded into the constant Third, after expanding (2) there is a term |Fx | inside the sum for Ht2 (φx) which is not present i an (3). [sent-142, score-0.197]

42 The bipartite graph of the first two methods differs from the structure used by Subramanya et al. [sent-150, score-0.223]

43 (2010) in that it does propagation between two different kinds ofdistributions instead ofonly one kind. [sent-151, score-0.237]

44 The θT-only method therefore uses the feature-only graph but for the distribution usePs a type level version of θ defined by θfTj = |X1f| Px∈Xf πx(j). [sent-158, score-0.207]

45 5 Novel YParowsky-prop algorithm We call our graph propagation based algorithm Yarowsky-prop. [sent-159, score-0.56]

46 It is shown with θT-only propagation in algorithm 3. [sent-160, score-0.325]

47 It is based on the Yarowsky algorithm, with the following changes: an added step to calculate θT (line 4), an added step to calculate θP (line 5), the use of θP rather than the DL to update the labelling (line 6), and the use of the sum definition of π. [sent-161, score-0.359]

48 This algorithm is adapted to the other propagation methods described in section 4 by changing the type of propagation on line 5. [sent-167, score-0.614]

49 1:let θfjbe the scores of the seed rules // crf train 2: for iteration t to maximum or convergence do 3: let πx (j) = |F1x| Pf∈Fx θfj // post. [sent-169, score-0.358]

50 decode Px∈X|Xf P|πx(j) 4: let θfTj = // token to type 5: propagate θT to get θP // graph propagate 6: label the data with θP // viterbi decode 7: train a new DL θfj // crf train 8: end for done on θ, using the graph of figure 1(b). [sent-170, score-0.379]

51 In φ-θ and π-θ propagation is done on the respective bipartite graph (figure 1(a) or the equivalent with π). [sent-171, score-0.491]

52 For the bipartite graph methods φ-θ and π-θ only the propagated θ values on the feature nodes are used for θP (the distributions on the example nodes are ignored after the propagation itself). [sent-173, score-0.619]

53 When labelling an example x P(at line 6 and also on testing data) we use argmaxj Pf∈Fx: θfP6=UθfPj, but set Yx = ⊥ if the sum is zerPo. [sent-177, score-0.319]

54 Ignoring uniform θfP values is intended to provide an equivalent to the DL behaviour of using evidence only from rules that are in the list. [sent-178, score-0.174]

55 At the labelling step on line 6 we label only examples which the pre-propagated θ would also assign a label (using the same rules described above for θP). [sent-181, score-0.514]

56 This choice is intended to provide an equivalent to the Yarowsky-cautious behaviour of limiting the number of labelled examples; most θfP are non-uniform, so without it most examples become labelled early. [sent-182, score-0.478]

57 (2010) by comparing algorithm 3 here with their algorithm 1. [sent-184, score-0.176]

58 Figure 2: A DL from iteration 5 of Yarowsky on the named entity task. [sent-225, score-0.246]

59 The test data additionally contains some noise examples which are not in the three named entity categories. [sent-236, score-0.204]

60 We use the ‘drug’, ‘land’, and ‘sentence’ tasks, and the seed rules from their best seed selection: ‘alcohol’/‘medical’, ‘acres’/‘court’, and ‘reads’/‘served’ respectively (they do not provide seeds for their other three tasks). [sent-247, score-0.259]

61 95, and cautiousness parameters n0 = ∆n = 5 as in Collins and Singer (1999) and propagation parameters = 0. [sent-253, score-0.436]

62 Initial experiments with different propagation parameters suggested that as long as ν was set at this value changing had relatively little effect on the accuracy. [sent-257, score-0.237]

63 We did not find any propagation parameter settings that outperformed this choice. [sent-258, score-0.237]

64 For the Yarowsky-prop algo- rithms we perform a single iteration of the propagation update for each iteration of the algorithm. [sent-259, score-0.522]

65 The named entity test set contains some examples that are neither person, organization, nor location. [sent-264, score-0.204]

66 Collins and Singer (1999) define noise accuracy as accuracy that includes such instances, and clean accuracy as accuracy calculated across only the examples which are one of the known labels. [sent-265, score-0.435]

67 We report only clean accuracy in this paper; noise accuracy tracks clean accuracy but is a little lower. [sent-266, score-0.329]

68 We also report (clean) non-seeded accuracy, which we define to be clean accuracy over only examples which are not assigned a label by the seed rules. [sent-268, score-0.362]

69 We test Yarowsky, Yarowsky-cautious, Yarowsky-sum, DL-CoTrain, HS-bipartite in all four forms, and Yarowsky-prop cautious and non-cautious and in all four forms. [sent-270, score-0.218]

70 For each algorithm except EM we perform a final retraining step 625 lGo co . [sent-271, score-0.269]

71 ldLSMWpae-uxlJikocenloga, fpLnera stuJiodrle sant,ofcCpmoraemknsetipdrax,entLyftE-,eoFLafEtT,uRFreITGsHT Figure 3: Named entity test set examples where Yarowsky-prop θ-only is correct and no other tested algorithms are correct. [sent-272, score-0.191]

72 The seed DL accuracy is also included for reference. [sent-279, score-0.181]

73 It numerically outperforms DL-CoTrain on the named entity task, is not (statistically) significantly worse on the drug and land tasks, and is significantly better on the sentence task. [sent-281, score-0.241]

74 It also numerically outperforms Yarowsky-cautious on the named entity task and is significantly better on the drug task. [sent-282, score-0.203]

75 Is significantly worse on the land task, where most algorithms converge at labelling all examples with the first sense. [sent-283, score-0.425]

76 Figure 3 shows (all) three examples from the named entity test set where Yarowsky-prop-cautious θ-only is correct but none of the other Yarowsky variants are. [sent-285, score-0.204]

77 Figure 4 shows the test set non-seeded accuracies as a function of the iteration for many of the algo8The software is included with the paper submission and will be maintained at https://github. [sent-292, score-0.158]

78 Algorithmnamed entitydrugTasklandsentence Table 3: Test set percent accuracy and non-seeded test set percent accuracy (respectively) for the algorithms on all tasks. [sent-294, score-0.273]

79 The Yarowsky-prop non-cautious algorithms quickly converge to the final accuracy and are not shown. [sent-300, score-0.167]

80 While the other algorithms (figure 4(a)) make a large accuracy improvement in the final retraining step, the Yarowskyprop (figure 4(b)) algorithms reach comparable accuracies earlier and gain much less from retraining. [sent-301, score-0.404]

81 In table 3, only the cautious algorithms are able to reach the 90% accuracy range. [sent-306, score-0.381]

82 To evaluate the effects of cautiousness we examine the Yarowsky-prop θ-only algorithm on the named entity task in more detail. [sent-307, score-0.408]

83 This algorithm has two classifiers which are trained in conjunction: the DL and the propagated θP. [sent-308, score-0.168]

84 Figure 5 shows the training set coverage (of the labelling on line 6 of algorithm 3) and the test set accuracy of both classifiers, for the cautious and non-cautious versions. [sent-309, score-0.682]

85 The DL and θP converge to similar accuracies within a few more iterations, and the retraining step increases accuracy by less than one percent. [sent-311, score-0.326]

86 On the other hand, the cautious version gradually increases the coverage over the iterations. [sent-312, score-0.286]

87 The DL accuracy follows the coverage closely (similar to the behaviour of Yarowsky- cautious, not shown here), while the propagated classifier accuracy jumps quickly to near 90% and then increases only gradually. [sent-313, score-0.334]

88 Although the DL prior to retraining achieves a roughly similar accuracy in both versions, only the cautious version is able to reach the 90% accuracy range in the propagated classifier and retraining. [sent-314, score-0.697]

89 The cautious version avoids this by making only safe rule selection and labelling choices. [sent-316, score-0.461]

90 Like the non-propagated DL algorithms, the DL component of Yarowsky-prop has much lower accuracy than the propagated classifier prior to the retraining step. [sent-319, score-0.34]

91 Iteration (a) Collins & Singer algorithms (plus sum form) Iteration (b) Yarowsky propagation cautious Figure 4: Non-seeded test accuracy versus iteration for various algorithms on named entity. [sent-321, score-0.888]

92 The results for the Yarowsky-prop algorithms are for the propagated classifier θP, except for the final DL retraining iteration. [sent-322, score-0.32]

93 5 Objective function The propagation method φ-θ was motivated by optimizing the equivalent objectives (2) and (3) at each iteration. [sent-324, score-0.268]

94 Figure 6 shows the graph propagation objective (3) along with accuracy for Yarowsky-prop φ-θ without cautiousness. [sent-325, score-0.552]

95 Conversely, the cautious version (not shown here) does not clearly minimize the objective, since cautiousness limits the effect of the propagation. [sent-327, score-0.448]

96 7 Conclusions Our novel algorithm achieves accuracy comparable to Yarowsky-cautious, but is better theoretically motivated by combining ideas from Haffari and Sarkar (2007) and Subramanya et al. [sent-328, score-0.193]

97 As future work, we would like to apply our al627 Iteration (a) Non-cautious Iteration (b) Cautious Figure 5: Internal train set coverage and non-seeded test accuracy (same scale) for Yarowsky-prop θ-only on named entity. [sent-331, score-0.18]

98 Iteration Figure 6: Non-seeded test accuracy (left axis), coverage (left axis, same scale), and objective value (right axis) for Yarowskyprop φ-θ. [sent-332, score-0.205]

99 We omit the first iteration (where the DL contains only the seed rules) and start the plot at iteration 2 where there is a complete DL. [sent-334, score-0.356]

100 We also believe that our method for adapting Collins and Singer (1999)’s cautiousness to Yarowsky-prop can be applied to similar algorithms with other underlying classifiers, even to structured output models such as conditional random fields. [sent-336, score-0.254]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dl', 0.408), ('yarowsky', 0.33), ('propagation', 0.237), ('cautious', 0.218), ('labelling', 0.212), ('cautiousness', 0.199), ('singer', 0.176), ('haffari', 0.174), ('subramanya', 0.167), ('retraining', 0.153), ('labelled', 0.148), ('graph', 0.147), ('fx', 0.142), ('iteration', 0.125), ('bootstrapping', 0.119), ('collins', 0.113), ('sarkar', 0.11), ('seed', 0.106), ('unlabelled', 0.106), ('fj', 0.102), ('objective', 0.093), ('algorithm', 0.088), ('examples', 0.083), ('propagated', 0.08), ('distributions', 0.079), ('bregman', 0.079), ('fxx', 0.079), ('yx', 0.079), ('abney', 0.077), ('bipartite', 0.076), ('accuracy', 0.075), ('named', 0.068), ('xf', 0.063), ('blum', 0.056), ('sum', 0.055), ('algorithms', 0.055), ('entity', 0.053), ('pi', 0.053), ('line', 0.052), ('clean', 0.052), ('karakos', 0.052), ('qv', 0.052), ('constant', 0.05), ('drug', 0.047), ('rules', 0.047), ('label', 0.046), ('qu', 0.045), ('convergence', 0.041), ('vertices', 0.04), ('coboost', 0.04), ('damianos', 0.04), ('ftj', 0.04), ('por', 0.04), ('wuv', 0.04), ('yarowskyprop', 0.04), ('crf', 0.039), ('eisner', 0.039), ('land', 0.038), ('axis', 0.038), ('currently', 0.038), ('pf', 0.037), ('converge', 0.037), ('coverage', 0.037), ('definition', 0.036), ('haghighi', 0.036), ('canada', 0.036), ('behaviour', 0.035), ('agichtein', 0.035), ('ghahramani', 0.035), ('snowball', 0.035), ('dls', 0.035), ('numerically', 0.035), ('rithms', 0.035), ('percent', 0.034), ('smoothing', 0.034), ('intended', 0.033), ('lafferty', 0.033), ('reach', 0.033), ('notation', 0.033), ('accuracies', 0.033), ('em', 0.032), ('classifier', 0.032), ('enidf', 0.032), ('gholamreza', 0.032), ('rle', 0.032), ('equivalent', 0.031), ('version', 0.031), ('qi', 0.03), ('theoretically', 0.03), ('semisupervised', 0.03), ('fp', 0.029), ('px', 0.029), ('sums', 0.029), ('thresholding', 0.029), ('spelling', 0.029), ('concerned', 0.029), ('distribution', 0.029), ('uniform', 0.028), ('step', 0.028), ('extra', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 42 acl-2012-Bootstrapping via Graph Propagation

Author: Max Whitney ; Anoop Sarkar

2 0.11313251 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

Author: Seokhwan Kim ; Gary Geunbae Lee

Abstract: Although researchers have conducted extensive studies on relation extraction in the last decade, supervised approaches are still limited because they require large amounts of training data to achieve high performances. To build a relation extractor without significant annotation effort, we can exploit cross-lingual annotation projection, which leverages parallel corpora as external resources for supervision. This paper proposes a novel graph-based projection approach and demonstrates the merits of it by using a Korean relation extraction system based on projected dataset from an English-Korean parallel corpus.

3 0.095420077 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou

Abstract: In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans. Unlike previous work on this topic, we formulate the problem as structured labeling over a much smaller graph, and we propose a novel structured label propagation for the task. We convert such graph-based translation consensus from similar source strings into useful features both for n-best output reranking and for decoding algorithm. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-ofthe-art baseline. 1

4 0.086921774 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

Author: Amar Subramanya ; Partha Pratim Talukdar

Abstract: While labeled data is expensive to prepare, ever increasing amounts of unlabeled linguistic data are becoming widely available. In order to adapt to this phenomenon, several semi-supervised learning (SSL) algorithms, which learn from labeled as well as unlabeled data, have been developed. In a separate line of work, researchers have started to realize that graphs provide a natural way to represent data in a variety of domains. Graph-based SSL algorithms, which bring together these two lines of work, have been shown to outperform the state-ofthe-art in many applications in speech processing, computer vision and NLP. In particular, recent NLP research has successfully used graph-based SSL algorithms for PoS tagging (Subramanya et al., 2010), semantic parsing (Das and Smith, 2011), knowledge acquisition (Talukdar et al., 2008), sentiment analysis (Goldberg and Zhu, 2006) and text categoriza- tion (Subramanya and Bilmes, 2008). Recognizing this promising and emerging area of research, this tutorial focuses on graph-based SSL algorithms (e.g., label propagation methods). The tutorial is intended to be a sequel to the ACL 2008 SSL tutorial, focusing exclusively on graph-based SSL methods and recent advances in this area, which were beyond the scope of the previous tutorial. The tutorial is divided in two parts. In the first part, we will motivate the need for graph-based SSL methods, introduce some standard graph-based SSL algorithms, and discuss connections between these approaches. We will also discuss how linguistic data can be encoded as graphs and show how graph-based algorithms can be scaled to large amounts of data (e.g., web-scale data). Part 2 of the tutorial will focus on how graph-based methods can be used to solve several critical NLP tasks, including basic problems such as PoS tagging, semantic parsing, and more downstream tasks such as text categorization, information acquisition, and 6 Partha Pratim Talukdar Carnegie Mellon University ppt @ cs . cmu . edu sentiment analysis. We will conclude the tutorial with some exciting avenues for future work. Familiarity with semi-supervised learning and graph-based methods will not be assumed, and the necessary background will be provided. Examples from NLP tasks will be used throughout the tutorial to convey the necessary concepts. At the end of this tutorial, the attendee will walk away with the following: • An in-depth knowledge of the current state-oftAhen- ainrt-d dienp graph-based SeS oLf algorithms, taantde- tohfeability to implement them. • The ability to decide on the suitability of graph-based S toSL d meceitdheod osn nfo trh a problem. • Familiarity with different NLP tasks where graph-based wSSitLh m dieftfehorednst h NaLveP Pb teaesnk successfully applied. In addition to the above goals, we hope that this tutorial will better prepare the attendee to conduct exciting research at the intersection of NLP and other emerging areas with natural graph-structured data (e.g., Computation Social Science). Please visit http://graph-ssl.wikidot.com/ for details. References Dipanjan Das and Noah A. Smith. 2011. Semi-supervised frame-semantic parsing for unknown predicates. In Proceedings of the ACL: Human Language Technologies. Andrew B. Goldberg and Xiaojin Zhu. 2006. Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In Proceedings ofthe Workshop on Graph Based Methods for NLP. Amarnag Subramanya and Jeff Bilmes. 2008. Soft-supervised text classification. In EMNLP. Amarnag Subramanya, Slav Petrov, and Fernando Pereira. 2010. Graph-based semi-supervised learning of structured tagging models. In EMNLP. Partha Pratim Talukdar, Joseph Reisinger, Marius Pasca, Deepak Ravichandran, Rahul Bhagat, and Fernando Pereira. 2008. Weakly supervised acquisition of labeled class instances using graph random walks. In EMNLP. Jeju, Republic of Korea,T 8ut Jourliya 2l0 A1b2s.tr ?ac c2t0s1 o2f A ACssLo 2c0ia1t2io,n p faogre C 6o,mputational Linguistics

5 0.081100836 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

Author: Arianna Bisazza ; Marcello Federico

Abstract: This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns, and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. Finally we encode these reorderings by modifying selected entries of the distortion cost matrix, on a per-sentence basis. In this way, we expand the search space by a much finer degree than if we simply raised the distortion limit. The proposed techniques are tested on Arabic-English and German-English using well-known SMT benchmarks.

6 0.080341443 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

7 0.080211885 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

8 0.076509148 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

9 0.070353553 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

10 0.064500622 153 acl-2012-Named Entity Disambiguation in Streaming Data

11 0.062646285 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

12 0.061207451 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

13 0.060795806 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

14 0.05813729 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding

15 0.057766501 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

16 0.057638653 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

17 0.056753334 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

18 0.056591429 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

19 0.055921528 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

20 0.0551771 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.179), (1, 0.036), (2, -0.024), (3, 0.018), (4, 0.013), (5, 0.046), (6, 0.009), (7, 0.028), (8, 0.023), (9, -0.043), (10, 0.053), (11, -0.019), (12, -0.047), (13, -0.041), (14, 0.018), (15, 0.018), (16, -0.01), (17, 0.041), (18, 0.101), (19, 0.018), (20, -0.038), (21, -0.074), (22, -0.032), (23, -0.033), (24, -0.029), (25, 0.111), (26, 0.011), (27, -0.093), (28, -0.076), (29, 0.182), (30, -0.051), (31, 0.015), (32, 0.085), (33, 0.022), (34, -0.021), (35, -0.13), (36, 0.119), (37, 0.109), (38, -0.08), (39, 0.042), (40, 0.013), (41, 0.075), (42, 0.071), (43, 0.052), (44, 0.082), (45, 0.008), (46, -0.08), (47, -0.029), (48, -0.062), (49, -0.129)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92887449 42 acl-2012-Bootstrapping via Graph Propagation

Author: Max Whitney ; Anoop Sarkar

2 0.54708242 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

Author: Jonathan Berant ; Ido Dagan ; Meni Adler ; Jacob Goldberger

Abstract: Learning entailment rules is fundamental in many semantic-inference applications and has been an active field of research in recent years. In this paper we address the problem of learning transitive graphs that describe entailment rules between predicates (termed entailment graphs). We first identify that entailment graphs exhibit a “tree-like” property and are very similar to a novel type of graph termed forest-reducible graph. We utilize this property to develop an iterative efficient approximation algorithm for learning the graph edges, where each iteration takes linear time. We compare our approximation algorithm to a recently-proposed state-of-the-art exact algorithm and show that it is more efficient and scalable both theoretically and empirically, while its output quality is close to that given by the optimal solution of the exact algorithm.

3 0.53405702 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

Author: Seokhwan Kim ; Gary Geunbae Lee

4 0.50071794 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

5 0.48871556 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

Author: Amar Subramanya ; Partha Pratim Talukdar

6 0.45829371 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding

7 0.44750619 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

8 0.42067608 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

9 0.42061016 181 acl-2012-Spectral Learning of Latent-Variable PCFGs

10 0.4137997 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

11 0.41205868 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

12 0.39386657 153 acl-2012-Named Entity Disambiguation in Streaming Data

13 0.39253628 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

14 0.39058116 83 acl-2012-Error Mining on Dependency Trees

15 0.38996372 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

16 0.38829732 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

17 0.37288859 117 acl-2012-Improving Word Representations via Global Context and Multiple Word Prototypes

18 0.36981702 163 acl-2012-Prediction of Learning Curves in Machine Translation

19 0.35727656 112 acl-2012-Humor as Circuits in Semantic Networks

20 0.35153857 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(26, 0.042), (28, 0.029), (30, 0.015), (37, 0.473), (39, 0.044), (74, 0.024), (82, 0.018), (84, 0.021), (85, 0.03), (90, 0.106), (92, 0.043), (94, 0.017), (99, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87032926 42 acl-2012-Bootstrapping via Graph Propagation

Author: Max Whitney ; Anoop Sarkar

2 0.86394954 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This system demonstration paper presents IRIS (Informal Response Interactive System), a chat-oriented dialogue system based on the vector space model framework. The system belongs to the class of examplebased dialogue systems and builds its chat capabilities on a dual search strategy over a large collection of dialogue samples. Additional strategies allowing for system adaptation and learning implemented over the same vector model space framework are also described and discussed. 1

3 0.84691107 64 acl-2012-Crosslingual Induction of Semantic Roles

Author: Ivan Titov ; Alexandre Klementiev

Abstract: We argue that multilingual parallel data provides a valuable source of indirect supervision for induction of shallow semantic representations. Specifically, we consider unsupervised induction of semantic roles from sentences annotated with automatically-predicted syntactic dependency representations and use a stateof-the-art generative Bayesian non-parametric model. At inference time, instead of only seeking the model which explains the monolingual data available for each language, we regularize the objective by introducing a soft constraint penalizing for disagreement in argument labeling on aligned sentences. We propose a simple approximate learning algorithm for our set-up which results in efficient inference. When applied to German-English parallel data, our method obtains a substantial improvement over a model trained without using the agreement signal, when both are tested on non-parallel sentences.

4 0.8398208 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification

Author: Zhaopeng Tu ; Yifan He ; Jennifer Foster ; Josef van Genabith ; Qun Liu ; Shouxun Lin

Abstract: Convolution kernels support the modeling of complex syntactic information in machinelearning tasks. However, such models are highly sensitive to the type and size of syntactic structure used. It is therefore an important challenge to automatically identify high impact sub-structures relevant to a given task. In this paper we present a systematic study investigating (combinations of) sequence and convolution kernels using different types of substructures in document-level sentiment classification. We show that minimal sub-structures extracted from constituency and dependency trees guided by a polarity lexicon show 1.45 pointabsoluteimprovementinaccuracy overa bag-of-words classifier on a widely used sentiment corpus. 1

5 0.8093456 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

Abstract: Optimising for one grammatical representation, but evaluating over a different one is a particular challenge for parsers and n-best CCG parsing. We find that this mismatch causes many n-best CCG parses to be semantically equivalent, and describe a hashing technique that eliminates this problem, improving oracle n-best F-score by 0.7% and reranking accuracy by 0.4%. We also present a comprehensive analysis of errors made by the C&C; CCG parser, providing the first breakdown of the impact of implementation decisions, such as supertagging, on parsing accuracy.

6 0.5329628 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

7 0.53022248 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

8 0.49574003 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

9 0.48607099 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

10 0.48312545 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

11 0.47093534 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

12 0.46837848 183 acl-2012-State-of-the-Art Kernels for Natural Language Processing

13 0.46405774 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

14 0.46165749 184 acl-2012-String Re-writing Kernel

15 0.45968053 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

16 0.45809424 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding

17 0.45651859 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

18 0.45385361 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

19 0.45269939 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

20 0.4525618 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool