acl acl2010 acl2010-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Prettenhofer ; Benno Stein
Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. [sent-4, score-0.471]
2 The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. [sent-5, score-0.194]
3 We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. [sent-6, score-0.23]
4 We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. [sent-7, score-0.278]
5 1 Introduction This paper deals with cross-language text classification problems. [sent-9, score-0.162]
6 Stated precisely: We are given a text classification task γ in a target language T for which no labeled documents are availagbualeg. [sent-11, score-0.453]
7 γ may wbeh a spam filtering task, a topic cvaatiel-gorization task, or a sentiment classification task. [sent-12, score-0.256]
8 In addition, we are given labeled documents for the identical task in a different source language S. [sent-13, score-0.255]
9 problems are addressed by constructing a classifier fS with training documents written in S and by applying fS tinog gun dloabcuemleden dtosc wumritetennts iwnri Stten in T . [sent-15, score-0.262]
10 ifferent approaches are current practice: gmuaacghein Te tdriaffnesrleatniton ap pofr o uanclahebsel aerde cduocrruemnten prtas cftriocme: T to S, dictionary-based translation of unlabeled documents from T to S, or language-independent concept modeling by means nogfu comparable corpora. [sent-17, score-0.336]
11 Here we propose a different approach to crosslanguage text classification which adopts ideas from the field of multi-task learning (Ando and Zhang, 2005a). [sent-19, score-0.245]
12 Our approach builds upon structural correspondence learning, SCL, a recently proposed theory for domain adaptation in the field of natural language processing (Blitzer et al. [sent-20, score-0.294]
13 In our context a pivot is a pair of words, {wS, wT}, foruorm c othntee source language rS o fa wndo tdhse, target la}n-, guage hTe , w sohuicrche possess a esim Sil aanrd dse tmhean tatircgse. [sent-23, score-0.34]
14 Tuneslat-beled documents from S and T yields two equivabelelnecde d occlaussmeesn across th Se saen languages: one qcluaisvscontains the documents where either wS or wT occur, the other class contains the documents where neither wS nor wT occur. [sent-25, score-0.459]
15 Ideally, a pivot splits the set of unlabeled documents with respect to the semantics that is associated with {wS, wT}. [sent-26, score-0.48]
16 As we will see, a small number of pivots can capture a sufficiently large part of the correspondences between S and T in order to (1) construct a cross-lingual representation adendr (2) l1e)a rcno a ctrlauscsti afie crr fST lfinorg tuhael task γ that operates on this representation. [sent-29, score-0.391]
17 The approach exploits the Twaosrkds s’p pragmatics Tshinece a pitp considers—during the pivot selection step—task-specific characteristics of language use. [sent-31, score-0.202]
18 Third, an in-depth analysis with respect to important hyperparameters such as the ratio of labeled and unlabeled documents, the number of pivots, and the optimum dimensionality of the cross-lingual representation. [sent-42, score-0.305]
19 In this connection we compile extensive corpora in the languages English, German, French, and Japanese, and for different sentiment classification tasks. [sent-43, score-0.298]
20 Section 4 describes our main contribution, a new approach to cross-language text classification based on structural correspondence learning. [sent-46, score-0.335]
21 Section 5 presents experimental results in the context of cross-language sentiment classification. [sent-47, score-0.14]
22 Traditional approaches to crosslanguage text classification and CLIR use linguis- tic resources such as bilingual dictionaries or parallel corpora to induce correspondences between two languages (Lavrenko et al. [sent-51, score-0.429]
23 (1997) is considered as seminal work in CLIR: they propose a method which induces semantic correspondences between two languages by performing latent semantic analysis, LSA, on a parallel corpus. [sent-55, score-0.151]
24 Gliozzo and Strapparava (2005) circumvent the dependence on a parallel corpus by using so-called multilingual domain models, which can be acquired from comparable corpora in an unsupervised manner. [sent-58, score-0.13]
25 Recent work in cross-language text classification focuses on the use of automatic machine translation technology. [sent-60, score-0.22]
26 Most of these methods involve two steps: (1) translation of the documents into the source or the target language, and (2) dimensionality reduction or semi-supervised learning to reduce the noise introduced by the machine translation. [sent-61, score-0.393]
27 Domain Adaptation Domain adaptation refers to the problem of adapting a statistical classifier trained on data from one (or more) source domains (e. [sent-65, score-0.199]
28 In the basic domain adaptation setting we are given labeled data from the source domain and unlabeled data from the target domain, and the goal is to train a classifier for the target domain. [sent-70, score-0.677]
29 Beyond this setting one can further dis- tinguish whether a small amount of labeled data from the target domain is available (Daume, 2007; Finkel and Manning, 2009) or not (Blitzer et al. [sent-71, score-0.202]
30 1119 Note that, cross-language text classification can be cast as an unsupervised domain adaptation problem by considering each language as a separate domain. [sent-74, score-0.283]
31 (2006) propose an effective algorithm for unsupervised domain adaptation, called structural correspondence learning. [sent-76, score-0.226]
32 SCL then models the correlation between the pivots and all other features by training linear classifiers on the unlabeled data from both domains. [sent-78, score-0.451]
33 Ando and Zhang (2005b) present a semi-supervised learning method based on this paradigm, which generates related tasks from unlabeled data. [sent-82, score-0.136]
34 (2007) apply structural learning to image classification in settings where little labeled data is given. [sent-84, score-0.257]
35 Whei ltah-out loss of generality we restrict ourselves to binary classification problems and linear classifiers, i. [sent-90, score-0.216]
36 , when choosing the hinge loss function for L one obtains the popular Support Vector Machine classifier (Zhang, 2004). [sent-98, score-0.136]
37 Standard text classification distinguishes between labeled (training) documents and unlabeled (test) documents. [sent-99, score-0.502]
38 Cross-language text classification poses an extra constraint in that training doc- kwk2 uments and test documents are written in different languages. [sent-100, score-0.344]
39 Here, the language of the training documents is referred to as source language S, and uthme language foefr rtehde t teost a dso scouumrceents la nisg rueafegrere Sd, ,to a as target language T . [sent-101, score-0.317]
40 vocabulary oarfy yt hVe source liantnoguage and vocabulary of the target language, with VS ∩ VT = ∅. [sent-103, score-0.282]
41 , documents from the training set a∩nd V the= t ∅es. [sent-106, score-0.142]
42 Thus, a linear classifier fS trained on DS associates non-zero weights only with words from VS, which in turn means that fS cannot be used to classify documents written in T . [sent-110, score-0.343]
43 e way to overcome this “feature barrier” is to find a cross-lingual representation for documents written in S and T , which enables the transmfere otfs wclraitstseinfic ianti Son an knowledge benetawbleeesn t the ter atwnsolanguages. [sent-112, score-0.218]
44 In the following, we will use to denote a map that associates the original |V |-dimensional representation of a doc- θ ument dl wVr |i-tdteimn einns iSo or rTe pwreitseh ittast cross-lingual representation. [sent-114, score-0.196]
45 nO innce S su orch T a mapping riso fsos-ulnindg tuhael cross-language text classification problem reduces to a standard classification problem in the crosslingual space. [sent-115, score-0.33]
46 Note that the existing methods for cross-language text classification can be characterized by the way θ is constructed. [sent-116, score-0.162]
47 1120 4 Cross-Language Structural Correspondence Learning We now present a novel method for learning a map θ by exploiting relations from unlabeled documents written in S and T . [sent-123, score-0.318]
48 We refer to this classificfeartieonnt tlaanskg as gthee T target t aresfke. [sent-125, score-0.132]
49 r A ton example fiofirthe target task is the determination of sentiment polarity, either positive or negative, of book reviews written in German (T ) given a sbeoto okf training wrerviitetewns i nwGr iteternm ainn English (S). [sent-126, score-0.422]
50 , a dfo cmaallisn expert) to map words in the source vocabulary VS to their corresponding translations in the target vocabulary VT. [sent-131, score-0.282]
51 For simplicity and without loss of applicability we assume here that the word translation oracle maps each word in VS to exactly one word in VT. [sent-132, score-0.162]
52 Considering our sentiment classification example, the word pair {excellentS, exzellentT} s eaxtiasmfipelse , b toheth coordnd piatiiorn {se: (1) tthe words are strong ifniedsica btootrhs o cfo positive sentiment, Words in VS Words in VT S T x = (x1 , . [sent-138, score-0.256]
53 x|V|) , Class label y DuDDDTSS,,uu term frequencies Positive class label No value Negative class label Figure 1: The document sets underlying CL-SCL. [sent-144, score-0.168]
54 u and (2) the words occur frequently in book reviews from both languages. [sent-146, score-0.155]
55 Note that the support of wS and wT can be determined from the unlabeled data Du. [sent-147, score-0.136]
56 bWelee use tah fer following heuristic to form an ordered set P of pivots: First, we choose a subset VP from the source vocabulary VS, |VP | ? [sent-149, score-0.157]
57 |VS |, which contains those words with th,e |V highest mutual information with respect to the class label of the target task in DS. [sent-150, score-0.154]
58 Itn m mth dee snoectoen |Pd step, nCuLm-SbCerL o fm poidvoetlss the correlations between each pivot {wS, wT} ∈ P and all loatthioenr ws boertdws w ∈ aVch \ {wS, wT}. [sent-152, score-0.746]
59 IN(x, pl) returns +1 if one of the components of x associated with the words in pl is non-zero and -1 otherwise. [sent-155, score-0.144]
60 Note that each training set Dl contains documents from both languages. [sent-157, score-0.142]
61 Thus, for a pivot pl = {wS, wT} the vector wl captures both the cor=rel {awtion be}tw teheen v wS and VS \ {wS} and the correlation between wT and VT \ {wT}. [sent-158, score-0.492]
62 In the third step, CL-SCL identifies correlations across pivots by computing the singular value decomposition ofthe |V | m-dimensional parameter cmoamtrpixo W, nWof = ? [sent-159, score-0.327]
63 ) Recall that W encodes the correlation structure between pivot and non-pivot words in the form of multiple linear classifiers. [sent-166, score-0.288]
64 Choosing the columns of U associated with the largest singular values yields those substructures that capture most of the correlation in W. [sent-168, score-0.155]
65 The vector v∗ that minimizes the regularized training error for DS in the projected space is defined as follows: v∗= arvg∈mRkin(x,Xy)∈DSL(y, vTθx) +2λkvk2 (2) The resulting classifier fST, which will operate in the cross-lingual setting, is defined as follows: × fST(x) = sign(v∗Tθx) 4. [sent-171, score-0.143]
66 1 An Alternative View of CL-SCL An alternative view of cross-language structural correspondence learning is provided by the framework of structural learning (Ando and Zhang, 2005a). [sent-172, score-0.252]
67 The basic idea of structural learning is Algorithm 1 CL-SCL Input: Labeled source data DS Unlabeled data Du = DS,u ∪ DT ,u Parameters: m, k, λ, and Output: k |V |-dimensional matrix θ φ 1. [sent-173, score-0.13]
68 In our context these auxiliary tasks are represented by the pivot predictors, i. [sent-187, score-0.202]
69 Each column vector wl can be considered as a linear classifier which performs well in both languages. [sent-190, score-0.228]
70 The subspace is used to constrain the learning of the target task by restricting the weight vector w to lie in the subspace defined by θT. [sent-195, score-0.288]
71 1122 5 Experiments We evaluate CL-SCL for the task of crosslanguage sentiment classification using English as source language and German, French, and Japanese as target languages. [sent-198, score-0.477]
72 1 Dataset and Preprocessing We compiled a new dataset for cross-language sentiment classification by crawling product reviews from Amazon. [sent-202, score-0.372]
73 a Tnh e4 cmraiwlliloend reviews in the three languages German, French, and Japanese. [sent-207, score-0.158]
74 (2007) a review with >3 (<3) stars is labeled as positive (negative); other reviews are discarded. [sent-212, score-0.178]
75 For each language the labeled reviews are grouped according to their category label, whereas we restrict our experiments to three categories: books, dvds, and music. [sent-213, score-0.178]
76 Since most of the crawled reviews are positive (80%), we decide to balance the number of positive and negative reviews. [sent-214, score-0.151]
77 In this study, we are interested in whether the cross-lingual representation induced by CL-SCL captures the difference between positive and negative reviews; by balancing the reviews we ensure that the imbal- ance does not affect the learned model. [sent-215, score-0.184]
78 Balancing is achieved by deleting reviews from the majority class uniformly at random for each languagespecific category. [sent-216, score-0.149]
79 The resulting sets are split into three disjoint, balanced sets, containing training documents, test documents, and unlabeled documents; the respective set sizes are 2,000, 2,000, and 9,000-50,000. [sent-217, score-0.136]
80 For each of the nine target-language-categorycombinations a text classification task is created by taking the training set ofthe product category in S and the test set of the same product category in TS . [sent-219, score-0.162]
81 For the pivot prediction task, λ is set to t∈he [ 0sm;6a]. [sent-232, score-0.202]
82 4 We investigated an alternative approach to obtain a sparse W by directly enforcing sparse pivot pre- dictors wl through L1-regularization (Tsuruoka et al. [sent-238, score-0.274]
83 com at 1123 T Category |UDnSla,ub|ele|Dd dTa ,tua| Uppµer Bouσnd µCL-σMT∆ µCL-SσCL∆ |DS,u||DT ,u| German books dvd music 50,000 30,000 25,000 50,000 50,000 50,000 83. [sent-252, score-0.204]
84 13) French books dvd music 50,000 30,000 25,000 32,000 9,000 16,000 83. [sent-258, score-0.204]
85 13) Japanese books dvd music 50,000 30,000 25,000 50,000 50,000 50,000 79. [sent-264, score-0.204]
86 For each task, the number of unlabeled docufrom S and T is given. [sent-332, score-0.136]
87 Accuracy scores (mean and standard deviation σ of 10 repetitions of on mthe S te asntd ds eTt o isf tghivee target language oTre are reported. [sent-333, score-0.251]
88 The resulting accuracy scores are referred to as upper bound; it informs us about the expected performance on the target task if training data in the target language is available. [sent-340, score-0.27]
89 Statistical machine translation technology offers a straightforward solution to the problem of cross-language text classification and has been used in a number of cross-language sentiment classification studies (Hiroshi et al. [sent-342, score-0.476]
90 Our baseline CL-MT works as follows: (1) learn a linear classifier on the training data, and (2) translate the test documents into the source language,6 (3) predict 6Again we use Google Translate. [sent-345, score-0.317]
91 the sentiment polarity of the translated test documents. [sent-346, score-0.14]
92 Note that the baseline CL-MT does not make use of unlabeled documents. [sent-347, score-0.136]
93 4 Performance Results and Sensitivity Table 1contrasts the classification performance of CL-SCL with the upper bound and with the baseline. [sent-349, score-0.233]
94 The average accuracy is about 82%, which is consistent with prior work on monolingual sentiment analysis (Pang et al. [sent-351, score-0.14]
95 The performance of CL-MT, however, differs considerably between the two European languages and Japanese: for Japanese, the average difference between the upper bound and CL-MT (9. [sent-354, score-0.159]
96 1124 sch o¨nT} and Figure 2: Influence of unlabeled data and hyperparameters on the performance of CL-SCL. [sent-359, score-0.221]
97 The rows show the performance ofCL-SCL as a function of (1) the ratio between labeled and unlabeled documents, (2) the number of pivots m, and (3) the dimensionality of the cross-lingual representation k. [sent-360, score-0.518]
98 and the minimum support φ of a pivot in DS,u and DT ,u. [sent-361, score-0.202]
99 Although CL-MT outperforms CL-SCL on most tasks for German and French, the difference in accuracy can be consid- ered as small (< 1%); merely for French book and music reviews the difference is about 2%. [sent-364, score-0.225]
100 The results indicate that if the difference between the upper bound and CL-MT is large, CL-SCL can circumvent the loss in accuracy. [sent-366, score-0.215]
wordName wordTfidf (topN-words)
[('ws', 0.441), ('wt', 0.24), ('pivots', 0.229), ('pivot', 0.202), ('ds', 0.164), ('scl', 0.159), ('pl', 0.144), ('documents', 0.142), ('sentiment', 0.14), ('vt', 0.137), ('unlabeled', 0.136), ('vs', 0.13), ('dl', 0.123), ('classification', 0.116), ('reviews', 0.116), ('sgd', 0.104), ('blitzer', 0.098), ('correspondence', 0.094), ('ando', 0.092), ('japanese', 0.09), ('french', 0.088), ('target', 0.087), ('dt', 0.086), ('crosslanguage', 0.083), ('svd', 0.081), ('du', 0.081), ('classifier', 0.08), ('structural', 0.079), ('dsl', 0.078), ('dvd', 0.078), ('wtx', 0.078), ('correspondences', 0.074), ('vocabulary', 0.072), ('fs', 0.072), ('wl', 0.072), ('music', 0.07), ('subspace', 0.068), ('adaptation', 0.068), ('mask', 0.062), ('clir', 0.062), ('labeled', 0.062), ('correlations', 0.061), ('german', 0.06), ('upper', 0.059), ('vp', 0.059), ('bound', 0.058), ('translation', 0.058), ('loss', 0.056), ('books', 0.056), ('fst', 0.055), ('dimensionality', 0.055), ('domain', 0.053), ('hyperparameters', 0.052), ('beautifuls', 0.052), ('borings', 0.052), ('etfhfeic', 0.052), ('langweiligt', 0.052), ('olsson', 0.052), ('quattoni', 0.052), ('svdlibc', 0.052), ('tuhael', 0.052), ('weimar', 0.052), ('wofh', 0.052), ('source', 0.051), ('zhang', 0.051), ('tv', 0.049), ('oracle', 0.048), ('text', 0.046), ('gthee', 0.045), ('cca', 0.045), ('hve', 0.045), ('strapparava', 0.045), ('linear', 0.044), ('correlation', 0.042), ('languages', 0.042), ('columns', 0.042), ('gliozzo', 0.042), ('circumvent', 0.042), ('competitiveness', 0.042), ('dee', 0.042), ('written', 0.04), ('book', 0.039), ('singular', 0.037), ('bel', 0.037), ('associates', 0.037), ('referred', 0.037), ('representation', 0.036), ('crawled', 0.035), ('parallel', 0.035), ('label', 0.034), ('fer', 0.034), ('substructures', 0.034), ('class', 0.033), ('bilingual', 0.033), ('sch', 0.033), ('constrain', 0.033), ('ste', 0.032), ('balancing', 0.032), ('vector', 0.032), ('minimizes', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999863 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning
Author: Peter Prettenhofer ; Benno Stein
Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
2 0.34153864 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Author: Bin Wei ; Christopher Pal
Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.
3 0.1502146 209 acl-2010-Sentiment Learning on Product Reviews via Sentiment Ontology Tree
Author: Wei Wei ; Jon Atle Gulla
Abstract: Existing works on sentiment analysis on product reviews suffer from the following limitations: (1) The knowledge of hierarchical relationships of products attributes is not fully utilized. (2) Reviews or sentences mentioning several attributes associated with complicated sentiments are not dealt with very well. In this paper, we propose a novel HL-SOT approach to labeling a product’s attributes and their associated sentiments in product reviews by a Hierarchical Learning (HL) process with a defined Sentiment Ontology Tree (SOT). The empirical analysis against a humanlabeled data set demonstrates promising and reasonable performance of the proposed HL-SOT approach. While this paper is mainly on sentiment analysis on reviews of one product, our proposed HLSOT approach is easily generalized to labeling a mix of reviews of more than one products.
4 0.13267003 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
Author: Daphna Shezaf ; Ari Rappoport
Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
5 0.12326564 210 acl-2010-Sentiment Translation through Lexicon Induction
Author: Christian Scheible
Abstract: The translation of sentiment information is a task from which sentiment analysis systems can benefit. We present a novel, graph-based approach using SimRank, a well-established vertex similarity algorithm to transfer sentiment information between a source language and a target language graph. We evaluate this method in comparison with SO-PMI.
6 0.11107446 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis
7 0.09732569 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons
8 0.092485487 157 acl-2010-Last but Definitely Not Least: On the Role of the Last Sentence in Automatic Polarity-Classification
9 0.091964394 25 acl-2010-Adapting Self-Training for Semantic Role Labeling
10 0.089410901 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization
11 0.084979698 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
12 0.078925245 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems
13 0.078466542 22 acl-2010-A Unified Graph Model for Sentence-Based Opinion Retrieval
14 0.076317176 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion
15 0.075774811 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification
16 0.074720874 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
17 0.072533481 122 acl-2010-Generating Fine-Grained Reviews of Songs from Album Reviews
18 0.071045592 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
19 0.070862792 161 acl-2010-Learning Better Data Representation Using Inference-Driven Metric Learning
20 0.068265423 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing
topicId topicWeight
[(0, -0.224), (1, 0.026), (2, -0.123), (3, 0.147), (4, -0.032), (5, -0.017), (6, 0.008), (7, 0.001), (8, 0.011), (9, 0.099), (10, -0.032), (11, 0.108), (12, 0.071), (13, -0.088), (14, -0.12), (15, -0.04), (16, 0.237), (17, -0.18), (18, 0.06), (19, -0.063), (20, -0.022), (21, -0.031), (22, -0.022), (23, -0.172), (24, -0.058), (25, 0.042), (26, 0.093), (27, -0.026), (28, -0.079), (29, -0.042), (30, 0.0), (31, 0.001), (32, 0.074), (33, 0.025), (34, -0.002), (35, 0.219), (36, -0.039), (37, -0.07), (38, 0.143), (39, 0.075), (40, 0.004), (41, -0.074), (42, -0.02), (43, -0.158), (44, -0.054), (45, -0.09), (46, -0.039), (47, -0.138), (48, 0.117), (49, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.93481594 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning
Author: Peter Prettenhofer ; Benno Stein
Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
2 0.90751171 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Author: Bin Wei ; Christopher Pal
Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.
3 0.58656806 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
Author: Daphna Shezaf ; Ari Rappoport
Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
4 0.51274949 157 acl-2010-Last but Definitely Not Least: On the Role of the Last Sentence in Automatic Polarity-Classification
Author: Israela Becker ; Vered Aharonson
Abstract: Two psycholinguistic and psychophysical experiments show that in order to efficiently extract polarity of written texts such as customerreviews on the Internet, one should concentrate computational efforts on messages in the final position of the text.
5 0.48162997 122 acl-2010-Generating Fine-Grained Reviews of Songs from Album Reviews
Author: Swati Tata ; Barbara Di Eugenio
Abstract: Music Recommendation Systems often recommend individual songs, as opposed to entire albums. The challenge is to generate reviews for each song, since only full album reviews are available on-line. We developed a summarizer that combines information extraction and generation techniques to produce summaries of reviews of individual songs. We present an intrinsic evaluation of the extraction components, and of the informativeness of the summaries; and a user study of the impact of the song review summaries on users’ decision making processes. Users were able to make quicker and more informed decisions when presented with the summary as compared to the full album review.
6 0.46893537 209 acl-2010-Sentiment Learning on Product Reviews via Sentiment Ontology Tree
7 0.46373186 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis
8 0.44255638 212 acl-2010-Simple Semi-Supervised Training of Part-Of-Speech Taggers
9 0.41483828 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems
10 0.40966725 161 acl-2010-Learning Better Data Representation Using Inference-Driven Metric Learning
11 0.3900474 42 acl-2010-Automatically Generating Annotator Rationales to Improve Sentiment Classification
12 0.3893244 151 acl-2010-Intelligent Selection of Language Model Training Data
13 0.38571522 25 acl-2010-Adapting Self-Training for Semantic Role Labeling
14 0.38468581 256 acl-2010-Vocabulary Choice as an Indicator of Perspective
15 0.38014168 210 acl-2010-Sentiment Translation through Lexicon Induction
16 0.35532486 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
17 0.34570149 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms
18 0.341178 26 acl-2010-All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision
19 0.33571681 63 acl-2010-Comparable Entity Mining from Comparative Questions
20 0.33170396 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons
topicId topicWeight
[(8, 0.272), (14, 0.018), (25, 0.055), (33, 0.032), (39, 0.012), (42, 0.024), (44, 0.017), (49, 0.029), (59, 0.104), (71, 0.047), (73, 0.045), (76, 0.01), (78, 0.019), (80, 0.011), (83, 0.072), (84, 0.026), (98, 0.124)]
simIndex simValue paperId paperTitle
same-paper 1 0.77935195 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning
Author: Peter Prettenhofer ; Benno Stein
Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
2 0.69103283 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People
Author: Nancy Ide ; Collin Baker ; Christiane Fellbaum ; Rebecca Passonneau
Abstract: The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English, and the project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, communitybased effort to create much needed language resources for NLP. This paper describes the MASC project, its corpus and annotations, and serves as a call for contributions of data and annotations from the language processing community.
3 0.6887871 262 acl-2010-Word Alignment with Synonym Regularization
Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata
Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.
4 0.56983399 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Author: Bin Wei ; Christopher Pal
Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.
5 0.56688339 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
Author: Xianpei Han ; Jun Zhao
Abstract: Name ambiguity problem has raised urgent demands for efficient, high-quality named entity disambiguation methods. In recent years, the increasing availability of large-scale, rich semantic knowledge sources (such as Wikipedia and WordNet) creates new opportunities to enhance the named entity disambiguation by developing algorithms which can exploit these knowledge sources at best. The problem is that these knowledge sources are heterogeneous and most of the semantic knowledge within them is embedded in complex structures, such as graphs and networks. This paper proposes a knowledge-based method, called Structural Semantic Relatedness (SSR), which can enhance the named entity disambiguation by capturing and leveraging the structural semantic knowledge in multiple knowledge sources. Empirical results show that, in comparison with the classical BOW based methods and social network based methods, our method can significantly improve the disambiguation performance by respectively 8.7% and 14.7%. 1
6 0.56633455 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
7 0.5642432 185 acl-2010-Open Information Extraction Using Wikipedia
8 0.56305438 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
9 0.56124353 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
10 0.56090403 169 acl-2010-Learning to Translate with Source and Target Syntax
11 0.56072348 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization
12 0.56027889 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
13 0.55991232 162 acl-2010-Learning Common Grammar from Multilingual Corpus
14 0.55900574 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
15 0.55897719 161 acl-2010-Learning Better Data Representation Using Inference-Driven Metric Learning
16 0.55869591 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results
17 0.55771911 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation
18 0.55663204 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
19 0.55554438 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
20 0.55529845 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web