emnlp emnlp2010 emnlp2010-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
Reference: text
sentIndex sentText sentNum sentScore
1 Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions. [sent-9, score-0.26]
2 460 This approach identifies long, faithfully repeated patterns in the acoustic signal. [sent-16, score-0.517]
3 These acoustic repetitions often correspond to terms useful for information retrieval tasks. [sent-17, score-0.515]
4 Critically, this method does not require a phonetically interpretable acoustic model or knowledge of the target language. [sent-18, score-0.445]
5 By analyzing a large untranscribed corpus of speech, this discovery procedure identifies a vast number of repeated regions that are subsequently grouped using a simple graph-based clustering method. [sent-19, score-0.524]
6 Given a large collection of text, NLP tools can classify documents by category (classification) and organize documents into similar groups for a high level view of the collection (clustering). [sent-24, score-0.307]
7 This means that unlike with text, where many tools can be applied to new languages and domains with minimal effort, the equivalent tools for speech corpora often require a significant investment. [sent-35, score-0.312]
8 Document clustering and classification work surprisingly well on pseudo-terms; performance on a Switchboard task approaches a baseline based on gold standard manual transcriptions. [sent-41, score-0.33]
9 This semi-supervised paradigm was relaxed even further with the pursuit of self organizing units (SOUs), phone-like units for which acoustic models are trained with completely unsupervised meth461 ods (Garcia and Gish, 2006). [sent-44, score-0.412]
10 Even though the move away from phonetic acoustic models improves the universality of the architecture, small amounts of orthographic transcription are still required to connect the SOUs with the lexicon. [sent-45, score-0.604]
11 The segmental dynamic time warping (S-DTW) algorithm (Park and Glass, 2008) was the first truly zero resource effort, designed to discover portions of the lexicon directly by searching for repeated acoustic patterns in the speech signal. [sent-46, score-0.737]
12 This work implicitly defined a new direction for speech processing research: unsupervised spoken term discovery, the entry point of our speech corpora analysis system. [sent-47, score-0.519]
13 As mentioned above, the application of NLP methods to speech corpora have traditionally relied on high resource ASR systems to provide automatic word or phonetic transcripts. [sent-50, score-0.412]
14 , 2007), for which the recognized words or phone n-grams are used to characterize the documents. [sent-52, score-0.271]
15 Early efforts to perform automatic topic segmentation of speech input without the aid of ASR systems have been promising (Malioutov et al. [sent-54, score-0.27]
16 1Typically, each frame represents a 25 or 30 ms window of speech sampled every 10 ms Figure 1: An example of a dotplot for the string “text processing vs. [sent-67, score-0.426]
17 1 Acoustic Dotplots When applied to text, the dotplot construct is re- markably simple: given character strings s1 and s2, the dotplot is a Boolean similarity matrix K(s1, s2) defined as Kij(s1, s2) = δ(s1[i], s2[j]). [sent-71, score-0.496]
18 ” The boxed diagonal line segment arises from the repeat of the word “processing,” while the main diagonal line trivially arises from self-similarity. [sent-75, score-0.291]
19 Thus, the search for line segments in K off the main diagonal provides a simple algorithmic means to identify repeated terms of possible interest, albeit sometimes partial, in a collection of text documents. [sent-76, score-0.274]
20 yM for two spoken documents, the acoustic dotplot is the real-valued N M cosine similarity matrix K(x, y) rdeeafiln-veadl as 462 Figure 2: An example of an acoustic dotplot for 8 seconds of speech (posteriorgrams) plotted against itself. [sent-88, score-1.711]
21 (1) Even though the application to speech is a distinctly noisier endeavor, sequences of frames repeated between the two audio clips will still produce approximate diagonal lines in the visualization of the matrix. [sent-93, score-0.433]
22 The search for matched regions thus reduces to the robust search for diagonal line segments in K, which can be efficiently performed with standard image processing techniques. [sent-94, score-0.312]
23 The choice of κ determines an approximate threshold on the duration of the matched regions discovered. [sent-96, score-0.299]
24 2 Posteriorgram Representation The acoustic dotplot technique can operate on any vector time series representation of the speech signal, including a standard spectrogram. [sent-101, score-0.838]
25 Phonetic posteriorgrams are a suitable choice, as each frame is represented as the posterior probability distribution over a set of speech sounds given the speech observed at the particular point in time, which is largely speaker-independent by construction. [sent-105, score-0.451]
26 Figure 3 shows an example posteriorgram for the utterance “I had to do that,” computed with a multi-layer perceptron (MLP)-based English phonetic acoustic model (see Section 5 for details). [sent-106, score-0.761]
27 Each row of the figure represents the posterior probability of the given phone as a function of time through the utterance and each column represents the posterior distribution over the phone set at that particular point in time. [sent-107, score-0.585]
28 The construction of speaker independent acoustic models typically requires a significant amount of transcribed speech. [sent-108, score-0.514]
29 Our proposed strategy is to employ a speaker independent acoustic model trained in a high resource language or domain to interpret multi-speaker data in the zero resource target setting. [sent-109, score-0.557]
30 (2007) and extended in Hazen and Margolis (2008), where the authors use Hungarian phonetic trigrams features to characterize English spoken documents for a topic classification task. [sent-112, score-0.534]
31 3While in this paper our acoustic model is based on our evaluation corpus, this is not a requirement of our approach. [sent-113, score-0.412]
32 Future work will investigate performance of other acoustic models. [sent-114, score-0.412]
33 463 of phonetic posterior distribution vectors (as opposed to reducing the speech to a one-best phonetic token sequence), the phone set used need not be matched to the target language. [sent-115, score-0.914]
34 With this approach, a speaker-independent model trained on the phone set of a reference language may be used to perform speaker independent term discovery in any other. [sent-116, score-0.431]
35 In addition to speaker independence, the use of phonetic posteriorgrams introduces representational sparsity that permits efficient dotplot computation and storage. [sent-117, score-0.596]
36 Using a grid of approximately 100 cores, we were able to perform the O(n2) dotplot computation and line segment search for 60+ hours of speech (corresponding to a 500 terapixel dotplot) in approximately 5 hours. [sent-121, score-0.549]
37 Figure 2 displays the posteriorgram dotplot for 8 seconds of speech against itself (i. [sent-122, score-0.594]
38 The large black boxes in the image result from long stretches of silence of filled pauses; fortunately, these are easily filtered with speech activity detection or simple measures of posteriorgram stability. [sent-127, score-0.333]
39 4 Creating Pseudo-Terms Spoken documents will be represented as bags of pseudo-terms, where pseudo-terms are computed from acoustic repetitions described in the previous section. [sent-128, score-0.595]
40 Let M be a set of matched regions (m), seaecchti consisting o bfe a pair ooff speech din rteegrvioanlss con- [t(1i),t2(i)], [t(1j),t2(j)] indicates the speech from t1(i) to t(2i) is an acoustic match to the speech from t1(j) to t2(j)). [sent-129, score-1.129]
41 We call the resulting clusters pseudo-terms since each cluster is a placeholder for a term (word or phrase) spoken in the collection. [sent-137, score-0.299]
42 To perform this pseudo-term clustering we represented matched regions as vertices in a graph with edges representing similarities between these regions. [sent-139, score-0.403]
43 The first represents repeated speech at distinct points in the corpus as determined by the match list M. [sent-144, score-0.283]
44 In particular, we expect improved clustering by introducing weights that reflect acoustic similarity between match intervals, rather than relying solely upon the term discovery algorithm to make a hard decision. [sent-152, score-0.731]
45 Table 1 contains several examples of pseudoterms and the matched regions included in each group. [sent-163, score-0.297]
46 The development data set was created by selecting the six most commonly prompted topics (recycling, capital punishment, drug testing, family finance, job benefits, car buying) and randomly selecting 60 sides of con- versations evenly across the topics (total 360 conversation sides. [sent-173, score-0.457]
47 Note that each participant contributed at most one conversation side per topic, so these 360 conversation sides represent 360 distinct speakers. [sent-176, score-0.3]
48 For the tuning data set, we selected an additional 60 sides of conversations evenly across the same six topics used for development, for a total of 360 conversations and 37. [sent-178, score-0.451]
49 We selected this data by sampling 100 conversation sides from the next six most popular conversation topics (family life, news media, public education, exercise/fitness, pets, taxes), yielding 600 conversation sides containing 61. [sent-183, score-0.626]
50 To provide the requisite speaker independent acoustic model, we compute English phone posteriorgrams using the multi-stream multi-layer perceptron-based architecture of Thomas et al. [sent-196, score-0.839]
51 While this is admittedly a large amount of supervision, it is important to emphasize our zero resource term discovery algorithm does not rely on the phonetic interpretability of this reference acoustic model. [sent-198, score-0.745]
52 4 4The generalization of the speaker independence of acoustic models across languages is not well understood. [sent-201, score-0.473]
53 Unsupervised clustering algorithms sort examples into groups, where each group contains documents that are similar. [sent-204, score-0.3]
54 For example, clustering methods can be used on search results to provide quick insight into the coverage of the returned documents (Zeng et al. [sent-206, score-0.3]
55 In the case of clustering conversations in our collection, we would normally obtain a transcript of the conversation and then extract a bag of words representation for clustering. [sent-209, score-0.641]
56 The resulting clusters may represent topics, such as the six topics used in our switchboard data. [sent-210, score-0.301]
57 We would like to know if similar clustering results can be obtained without the use of a manual or automatic transcript. [sent-212, score-0.29]
58 In our case, we substitute the pseudo-terms discovered in a conversation for the transcript, representing tent the phonetic similarity of the target and reference language. [sent-213, score-0.311]
59 Unsupervised learning of speaker independent acoustic models remains an important area of future research. [sent-214, score-0.473]
60 In our experiments, we use the six topic labels provided by Switchboard as the clustering labels. [sent-217, score-0.33]
61 While optimal purity can be obtained by putting each document in its own cluster, we fix the number of clusters in all experiments so purity numbers are comparable. [sent-230, score-0.43]
62 The purity of a cluster is defined as the largest percentage of examples in a cluster that have the same topic label. [sent-231, score-0.388]
63 Purity of the entire clustering is the average purity of each cluster: purity(C,L) =N1cXi∈Cmlj∈aLx|ci∩ lj| (3) where C is the clustering, L is the reference labeling, and N are the number of examples. [sent-232, score-0.352]
64 Specifically, entropy(C, L) is given by: −cXi∈CNNilXj∈LP(ci,lj)log2P(ci,lj) (4) 466 where Ni is the number of instances in cluster i, P(ci, lj) is the probability of seeing label lj in cluster ci and the other variables are defined as above. [sent-236, score-0.263]
65 B-Cubed measures clustering effectiveness from the perspective of a user’s inspecting the clustering results (Bagga and Baldwin, 1998). [sent-237, score-0.44]
66 2 Clustering Algorithms We considered several clustering algorithms: repeated bisection, globally optimal repeated bisection, and agglomerative clustering (see Karypis (2003) for implementation details). [sent-245, score-0.76]
67 Each bisection algorithm is run 10 times and the optimal clustering is selected according to a provided criteria function (no true labels needed). [sent-246, score-0.427]
68 We used the Cluto clustering library for all clustering experiments (Karypis, 2003). [sent-251, score-0.44]
69 In the following section, we report results for the optimal clustering configuration based on experiments on the development data. [sent-252, score-0.34]
70 This baseline is based on a vanilla phone recognizer on top of the same MLP-based acoustic model (see Section 5 and the references therein for details) used to discover the pseudo-terms. [sent-257, score-0.737]
71 In particular, the phone posteriorgrams were transformed to frame-level monophone state likelihoods (through division by the framelevel priors). [sent-258, score-0.366]
72 These state likelihoods were then used along with frame-level phone transition probabilities to Viterbi decode each conversation side. [sent-259, score-0.39]
73 It is important to emphasize that the reliability of phone recognizers depends on the phone set matching the application language. [sent-260, score-0.542]
74 Using the English acoustic model in this manner on another language will significantly degrade the performance numbers reported below. [sent-261, score-0.412]
75 Representative results on development data with various parameter settings for this clustering configuration appear in Table 3. [sent-273, score-0.337]
76 93u876Fs 1i978ng globally optimal repeated bisection and I2 criteria. [sent-289, score-0.329]
77 The gbleostb a rlelsyu oltpst over tehpee mtedanu biasle cwtioornd transcript baselines and for each match duration (κ) are highlighted in bold. [sent-290, score-0.267]
78 Pseudo-term results are better than the phonetic baseline and almost as good as the transcript baseline. [sent-291, score-0.383]
79 Compared with the phone trigram features determined by the phone recognizer output, the pseudoterms perform significantly better. [sent-294, score-0.71]
80 Note that these two automatic approaches were built using the identical MLP-based phonetic acoustic model. [sent-295, score-0.604]
81 We sought to select the optimal parameter settings for running on the evaluation data using the development data and the held out tuning data. [sent-296, score-0.362]
82 We choose settings for κ, τ and the clustering parameters that independently maximize the performance averaged over all runs on development data. [sent-298, score-0.337]
83 We then selected the single run corresponding to these parameter settings and checked the re- sult on the held out tuning data. [sent-299, score-0.275]
84 m963087eF5t312ers (globally optimal repeated bisection clustering with I2 (cgrlitoebriaal,l κ =pt 0m. [sent-311, score-0.499]
85 l9u8s)t were sweiltehcte Id using the development data and validated on tuning data. [sent-313, score-0.257]
86 Note that the clusters produced by each manual transcript test were identical in this case. [sent-314, score-0.298]
87 (globally optimal repeated bisection clustering with I2 (cgrlitoebriaal,l κ p=t m0. [sent-315, score-0.499]
88 Results on held out tuning and evaluation data for this setting compared to the manual word transcripts and phone recognizer output are shown in Tables 4 and 5. [sent-322, score-0.63]
89 While the manual transcript baseline is better than our pseudo-term representations, the results are quite competitive. [sent-324, score-0.261]
90 Notice also that the pseudoterm performance remains significantly higher than the phone recognizer baseline on both sets. [sent-326, score-0.363]
91 We then select the opti- mal parameter settings and validate this selection on the held out tuning data, before generating the final representations for the evaluation once the optimal parameters have been selected. [sent-340, score-0.302]
92 The best pseudo-term and manual transcript results for each algorithm are bolded. [sent-358, score-0.261]
93 Pseudo-term results are better than the phonetic baseline and almost as good as the transcript baseline. [sent-360, score-0.383]
94 The performance for pseudoterms and phone trigrams are roughly comparable, though we expect pseudo-terms to be more robust across languages. [sent-362, score-0.426]
95 Results on held out tuning and evaluation data for this setting compared to the manual transcripts are shown in Tables 7 and 8. [sent-368, score-0.305]
96 Pseudo-term results are very close to the transcript baseline and often better than the phonetic baseline. [sent-382, score-0.383]
97 Pseudo-term results are very close to the transcript baseline and often better than the phonetic baseline. [sent-393, score-0.383]
98 like units in the speech, we perform unsupervised topic clustering as well as supervised classification of spoken documents with performance approaching that achieved with the manual word transcripts, and generally matching or exceeding that achieved with a phonetic recognizer. [sent-394, score-0.783]
99 Our study identified several opportunities and challenges in the development of NLP tools for spoken documents that rely on little or no linguistic resources such as dictionaries and training corpora. [sent-395, score-0.33]
100 Topic identification from audio recordings using word and phone recognition lattices. [sent-468, score-0.366]
wordName wordTfidf (topN-words)
[('acoustic', 0.412), ('phone', 0.271), ('dotplot', 0.248), ('clustering', 0.22), ('phonetic', 0.192), ('transcript', 0.191), ('speech', 0.178), ('switchboard', 0.152), ('hazen', 0.133), ('purity', 0.132), ('spoken', 0.123), ('asr', 0.12), ('conversation', 0.119), ('posteriorgram', 0.114), ('pseudoterms', 0.114), ('bisection', 0.114), ('tuning', 0.108), ('repeated', 0.105), ('repetitions', 0.103), ('regions', 0.102), ('cluster', 0.099), ('jansen', 0.098), ('posteriorgrams', 0.095), ('diagonal', 0.093), ('hours', 0.087), ('matched', 0.081), ('documents', 0.08), ('held', 0.077), ('duration', 0.076), ('manual', 0.07), ('document', 0.069), ('conversations', 0.068), ('tools', 0.067), ('lj', 0.065), ('dredze', 0.063), ('sides', 0.062), ('speaker', 0.061), ('topics', 0.06), ('development', 0.06), ('optimal', 0.06), ('phoneme', 0.059), ('discovery', 0.059), ('topic', 0.058), ('settings', 0.057), ('crammer', 0.057), ('audio', 0.057), ('recognizer', 0.054), ('seconds', 0.054), ('six', 0.052), ('intervals', 0.052), ('validated', 0.051), ('globally', 0.05), ('transcripts', 0.05), ('aren', 0.049), ('cluto', 0.049), ('lvcsr', 0.049), ('telephone', 0.048), ('cw', 0.045), ('prompted', 0.044), ('sec', 0.044), ('utterance', 0.043), ('bag', 0.043), ('church', 0.043), ('resource', 0.042), ('transcribed', 0.041), ('boxes', 0.041), ('trigrams', 0.041), ('term', 0.04), ('classification', 0.04), ('collection', 0.04), ('threshold', 0.04), ('amig', 0.038), ('bagga', 0.038), ('coppersmith', 0.038), ('dotplots', 0.038), ('eij', 0.038), ('garcia', 0.038), ('kij', 0.038), ('novotney', 0.038), ('pseudoterm', 0.038), ('sous', 0.038), ('sweiltehcte', 0.038), ('untranscribed', 0.038), ('mechanical', 0.038), ('koby', 0.038), ('recognition', 0.038), ('clusters', 0.037), ('user', 0.037), ('cosine', 0.036), ('mira', 0.036), ('line', 0.036), ('efforts', 0.034), ('entropy', 0.034), ('selected', 0.033), ('boxed', 0.033), ('phonetically', 0.033), ('malioutov', 0.033), ('zeng', 0.033), ('clauset', 0.033), ('karypis', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
2 0.1127759 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation
Author: Chen Zhang ; Joyce Chai
Abstract: While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic representations: a basic representation based on syntactic parsing from conversation utterances and an augmented representation taking into consideration of conversation structures. For each of these levels, we further explore two ways of capturing long distance relations between language constituents: implicit modeling based on the length of distance and explicit modeling based on actual patterns of relations. Our empirical findings have shown that the augmented representation with conversation structures is important, which achieves the best performance when combined with explicit modeling of long distance relations.
3 0.10724336 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech
Author: Vladimir Eidelman ; Zhongqiang Huang ; Mary Harper
Abstract: This paper examines tagging models for spontaneous English speech transcripts. We analyze the performance of state-of-the-art tagging models, either generative or discriminative, left-to-right or bidirectional, with or without latent annotations, together with the use of ToBI break indexes and several methods for segmenting the speech transcripts (i.e., conversation side, speaker turn, or humanannotated sentence). Based on these studies, we observe that: (1) bidirectional models tend to achieve better accuracy levels than left-toright models, (2) generative models seem to perform somewhat better than discriminative models on this task, and (3) prosody improves tagging performance of models on conversation sides, but has much less impact on smaller segments. We conclude that, although the use of break indexes can indeed significantly im- prove performance over baseline models without them on conversation sides, tagging accuracy improves more by using smaller segments, for which the impact of the break indexes is marginal.
4 0.10639788 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
5 0.10459021 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
Author: Roberto Navigli ; Giuseppe Crisafulli
Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
6 0.10419531 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
7 0.093331322 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
8 0.084287204 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails
9 0.084134817 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
10 0.082639158 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
11 0.080145769 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
12 0.077455968 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
13 0.077179261 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
14 0.076947697 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
15 0.075605303 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
16 0.072285376 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
17 0.069634773 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
18 0.065546528 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
19 0.063784815 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
20 0.061298087 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
topicId topicWeight
[(0, 0.217), (1, 0.149), (2, -0.14), (3, 0.006), (4, -0.033), (5, 0.015), (6, -0.083), (7, 0.009), (8, -0.027), (9, 0.112), (10, -0.062), (11, -0.186), (12, -0.156), (13, 0.148), (14, 0.018), (15, -0.191), (16, -0.036), (17, -0.08), (18, 0.008), (19, 0.01), (20, -0.068), (21, 0.059), (22, -0.036), (23, 0.046), (24, 0.148), (25, 0.157), (26, -0.044), (27, 0.018), (28, -0.049), (29, -0.125), (30, 0.077), (31, -0.135), (32, 0.021), (33, 0.033), (34, -0.038), (35, 0.068), (36, -0.04), (37, 0.07), (38, 0.078), (39, 0.178), (40, 0.01), (41, 0.123), (42, -0.151), (43, -0.009), (44, -0.067), (45, 0.029), (46, -0.076), (47, -0.026), (48, 0.047), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.94724023 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
2 0.49304417 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
3 0.46521845 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
4 0.45669439 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation
Author: Chen Zhang ; Joyce Chai
Abstract: While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic representations: a basic representation based on syntactic parsing from conversation utterances and an augmented representation taking into consideration of conversation structures. For each of these levels, we further explore two ways of capturing long distance relations between language constituents: implicit modeling based on the length of distance and explicit modeling based on actual patterns of relations. Our empirical findings have shown that the augmented representation with conversation structures is important, which achieves the best performance when combined with explicit modeling of long distance relations.
5 0.44358617 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech
Author: Vladimir Eidelman ; Zhongqiang Huang ; Mary Harper
Abstract: This paper examines tagging models for spontaneous English speech transcripts. We analyze the performance of state-of-the-art tagging models, either generative or discriminative, left-to-right or bidirectional, with or without latent annotations, together with the use of ToBI break indexes and several methods for segmenting the speech transcripts (i.e., conversation side, speaker turn, or humanannotated sentence). Based on these studies, we observe that: (1) bidirectional models tend to achieve better accuracy levels than left-toright models, (2) generative models seem to perform somewhat better than discriminative models on this task, and (3) prosody improves tagging performance of models on conversation sides, but has much less impact on smaller segments. We conclude that, although the use of break indexes can indeed significantly im- prove performance over baseline models without them on conversation sides, tagging accuracy improves more by using smaller segments, for which the impact of the break indexes is marginal.
6 0.40702808 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails
7 0.38662866 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
8 0.38630849 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
9 0.3770847 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
10 0.37650874 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
11 0.36680695 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules
12 0.35462961 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
13 0.32819116 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
14 0.32230905 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
15 0.31184348 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
16 0.3084603 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
17 0.27443716 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
18 0.27082828 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
19 0.26006061 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
20 0.25929949 4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions
topicId topicWeight
[(3, 0.021), (10, 0.017), (12, 0.053), (29, 0.104), (30, 0.026), (32, 0.027), (52, 0.023), (56, 0.074), (62, 0.018), (66, 0.12), (72, 0.057), (76, 0.031), (87, 0.033), (89, 0.024), (99, 0.282)]
simIndex simValue paperId paperTitle
same-paper 1 0.74862438 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
2 0.54972023 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan
Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.
3 0.5496332 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
Author: Hui Zhang ; Min Zhang ; Haizhou Li ; Eng Siong Chng
Abstract: This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.
4 0.54945183 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu
Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.
5 0.54944283 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
Author: Samidh Chatterjee ; Nicola Cancedda
Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.
6 0.54456598 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
7 0.54402149 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
8 0.54178679 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
9 0.54062134 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
10 0.54011589 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
11 0.53992671 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields
12 0.53981948 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
13 0.53915602 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
14 0.53888202 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
15 0.5386017 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
16 0.53817785 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task
17 0.53810143 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
18 0.53742445 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
19 0.5362072 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
20 0.53607911 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields