emnlp emnlp2011 emnlp2011-23 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Duangmanee Putthividhya ; Junling Hu
Abstract: We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.
Reference: text
sentIndex sentText sentNum sentScore
1 2065 Hamilton Ave San Jose, CA 95125 dputthividhya @ ebay . [sent-2, score-0.256]
2 com Abstract We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. [sent-3, score-0.504]
3 Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. [sent-4, score-0.275]
4 In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. [sent-5, score-0.291]
5 Focusing on listings from eBay’s clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. [sent-6, score-1.237]
6 Among the top 300 new brands predicted, our system achieves 90. [sent-7, score-0.412]
7 To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice. [sent-9, score-0.382]
8 1 Introduction Traditional named entity recognition (NER) task has expanded beyond identifying people, location, and organization to book titles, email addresses, phone numbers, and protein names (Nadeau and Sekine 2007). [sent-10, score-0.272]
9 In this paper, we focus on mining short product listing titles, which poses unique challenges. [sent-15, score-0.23]
10 2065 Hamilton Ave San Jose, CA 95 125 j uhu@ ebay . [sent-17, score-0.256]
11 com 1557 Short listings are typical in classified ads where each seller is given limited space (in terms of words) to describe the product. [sent-18, score-0.356]
12 On eBay, product listing titles cannot exceed 55 characters in length. [sent-19, score-0.311]
13 Extracting product attributes from such short titles faces the following challenges: • • • Loss of grammatical structure in short listings Lwohsesre o many nouns are piled together. [sent-21, score-0.571]
14 It can be argued that the use of short listings simplifies the problem of attribute extraction, since short listings can be easily annotated and one can apply supervised learning approach to extract product attributes. [sent-24, score-1.061]
15 However, as the size of the data grows, obtaining labeled training set on the scale of millions of listings becomes very expensive. [sent-25, score-0.356]
16 We formulate the product attribute extraction problem as a named entity recognition (NER) task and investigate supervised and semi-supervised approaches to this problem. [sent-27, score-0.601]
17 In addition, we have investigated attribute discovery, and normalization to standardized values. [sent-28, score-0.266]
18 We use listings from eBay’s clothing and shoes categories and develop an attribute extraction system for 4 attribute types. [sent-29, score-1.241]
19 We have 105, 335 listings from men’s clothing category and 72, 628 listings from women’s clothing category Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1. [sent-30, score-1.33]
20 In the first part of this work, we outline a supervised learning approach to attribute value extraction where we train a sequential classifier and evaluate the extraction performance on a set of hand-labeled listings. [sent-33, score-0.439]
21 In the second part of our work, to grow our seed list of attributes, we present a bootstrapped algorithm for attribute value discovery and normalization, honing in on one particular attribute (brand). [sent-36, score-0.678]
22 The goal is given an initial list of unambiguous brands, we grow the seed dictionary by discovering context patterns that are often associated with such attribute type. [sent-37, score-0.466]
23 the word camel both a brand and a color will not be a part of this initial seed list to create the training set. [sent-41, score-0.657]
24 A classifier is then trained to learn context patterns surrounding the known brands from the training set, and is used to discover new brands from the test set. [sent-42, score-0.921]
25 Finally, for known attribute values, we normalize the results to match to words in our dictionary. [sent-43, score-0.268]
26 Normalizing the variants of a known brand to a single normalized output value is an important aspect of — — a successful information extraction system. [sent-44, score-0.544]
27 The main contribution of this work is a product attribute extraction system that addresses the unique problems of information extraction from short listing titles. [sent-46, score-0.54]
28 We combine supervised NER with bootstrapping to expand the seed list, and investigate several methods to normalize the extracted results. [sent-47, score-0.253]
29 1558 2 Related Work Recent work on product attribute extraction by (Brody and Elhadad 2010) applies a Latent Dirichlet Allocation (LDA) model to identify different aspects of products from user reviews. [sent-49, score-0.346]
30 Our work is most closely related to (Ghani 2006), where a set of product attributes of interests are predefined and a supervised learning method is applied to extract the correct attribute values for each class. [sent-56, score-0.409]
31 Then, these auto-labeled training set is used to train a classifier to identify new attribute values from a separate set ofunlabeled data. [sent-61, score-0.297]
32 Thirdly, newly discovered product attribute values are added back to our seed list. [sent-62, score-0.551]
33 Thus our original classifier for product attribute extraction can be improved through an expanded seed list. [sent-63, score-0.561]
34 Our seed list expansion algorithm indeed bears some similarity to the work of (Nadeau el al 2006) and (Nadeau 2007). [sent-71, score-0.216]
35 In (Nadeau el al 2006), automobile brands are learned automatically from web page context. [sent-72, score-0.434]
36 First, a small set of 196 seed brands are extracted together with their associated web page con- texts from popular news feed. [sent-73, score-0.564]
37 In our case, user-generated short product listings may have many nouns concatenated together without forming a phrase or obeying correct grammatical rules. [sent-82, score-0.43]
38 2009), where instances of known entity relations (or seed list in our paper) are matched to sentences in a set of Wikipedia articles, and a learning algorithm is trained from the sur- rounding features of the entities. [sent-84, score-0.278]
39 In our case, we apply our learned model to a new test set, and discover new brand names from the listings. [sent-86, score-0.492]
40 3 Corpus The data used in all analysis in this paper is obtained from eBay’s clothing and shoes category. [sent-92, score-0.379]
41 Clothing and shoes have been important revenue-generating categories on the eBay site, and a successful attribute extraction system will serve as an invaluable tool for gathering important business and market- ing intelligence. [sent-93, score-0.393]
42 For these categories, the attributes that we are interested in are brand (B), garment type/style (G), size (S), and color (C). [sent-94, score-0.666]
43 We gather 105, 335 listings from men’s clothing category and 72, 628 listings from women’s clothing category, constituting a dataset of 1, 380, 337 word tokens. [sent-95, score-1.3]
44 A few examples of listings from eBay’s clothing and shoes categories are shown in Fig 1. [sent-98, score-0.735]
45 When designing an attribute extraction system to distinguish between the 4 attribute types, we must take into account the fact that individual words alone without considering context are ambiguous, as each word can belong to multiple attribute types. [sent-99, score-0.74]
46 To give concrete examples, inc is a brand name of women’s apparel but many sellers use it as an acronym for inch (brand vs. [sent-100, score-0.56]
47 The word blazer can be a brand entity or it can be a garment type (brand vs. [sent-102, score-0.66]
48 In addition, like other real-world user-generated texts, eBay listings are littered with site-specific acronyms, e. [sent-104, score-0.356]
49 4 Supervised Named Entity Recognition In the first part of this work, we adopt a supervised named entity recognition (NER) framework for the attribute extraction problem from eBay listing titles. [sent-110, score-0.683]
50 The goal is to correctly extract attribute values corresponding to the 4 attribute types, from each listing. [sent-111, score-0.468]
51 We generate our training data in Figure 1: Example listings and their corresponding labels from the clothing and shoes category. [sent-113, score-0.735]
52 Given 4 dictionaries of seed values for the 4 attribute types, we match n-gram tokens to the seed values in the dictionaries, and create an initial round of labeled training set, which must then be manually inspected for correctness. [sent-117, score-0.538]
53 In this work, we tagged and manu- ally verified 1, 000 listings randomly sampled from the 105, 335 listings from the men’s clothing category, resulting in a total of 7, 921 labeled tokens with 1, 521-word vocabulary. [sent-118, score-0.991]
54 1 Classifiers One of the most popular generative model based classifiers for named entity recognition tasks is Hidden Markov Model (HMM), which explicitly captures temporal statistics in the data by modeling state (label/tag) transitions over time. [sent-122, score-0.288]
55 More recently, Conditional Random Fields (CRF) (Feng and McCallum 2004; McCallum 2003) has been proposed for a sequence labeling problem and has been established by many as the state-ofthe-art model for supervised named entity recognition task. [sent-126, score-0.255]
56 In our task, a sequence of word tokens from listing titles are our observations. [sent-133, score-0.237]
57 This is mainly due to the fact that eBay listing titles are not complete sentences and the output from running a POS tagger through such data can indeed be unreliable. [sent-168, score-0.276]
58 In addition, we find that morphological features are less predictive of entity names in eBay listing titles than they are in formal documents. [sent-170, score-0.387]
59 Indeed CRF has been established by many as the state-of-theart supervised named entity recognition system for traditional NER tasks (Feng and McCallum 2004; McCallum 2003), for NER in biomedical texts (Settles 2004), and in various languages besides English, such as Bengali (Ekbal et al. [sent-192, score-0.255]
60 Given 1, 000 manually tagged listings from the clothing and shoes category in eBay, we adopt a 90-10 split and use 90% of the data for training and 10% for testing. [sent-198, score-0.765]
61 R3-5F% Table 2: Classification accuracy (%) on 9-class NER on men’s clothing dataset, comparing SVM, MaxEnt, supervised HMM, and CRF. [sent-203, score-0.32]
62 ally assigned to one of the 5 tags: brand (B), size (S), color (C), garment type (G), and none of the above (NA). [sent-204, score-0.606]
63 In order to more accurately capture the boundary of multi-token attribute values, we further sub-divide each tag into 2 classes using -beg and -in sub-tags. [sent-205, score-0.255]
64 1 Growing Seed Dictionary In this work, we focus on the problem of how to grow the seed dictionary and discovering new brand names from eBay listing data. [sent-235, score-0.88]
65 edu/ however, especially with a small training set size, we often find that the trained model puts too much weight on the dictionary membership feature and new attribute values are not properly detected. [sent-243, score-0.338]
66 In this section, instead of using the seed list of known attribute values as a feature into a classifier, we use the seed values to automatically generate labeled training data. [sent-244, score-0.572]
67 For the specific case of brand discovery, this initial list used to generate training data must contain only names that are unambiguously brands. [sent-245, score-0.492]
68 The training/test data is generated by matching N-gram tokens in listing titles to all the entries in the initial brand seed dictionary. [sent-247, score-0.868]
69 The listings with at least one non-NA tags are put in the training set, and listings that contain only NA tags are in the test set. [sent-250, score-0.712]
70 The partitioning is done, as de- scribed in great detail above, in such a way that known brands in the seed list do not exist in the Table 3: Discovered attribute values, ranked order by their confidence scores. [sent-254, score-0.832]
71 (Middle) Discovered brands from Men’s clothing category, with 3,499 seed values used. [sent-257, score-0.843]
72 (Right) Discovered garment types (styles) from Men’s clothing category, learned from 203 seed values. [sent-258, score-0.565]
73 During the test phase, the classifier predicts the most likely brand attribute from each listing, where we are only interested in the predictions with confidence scores exceeding a set threshold. [sent-263, score-0.731]
74 We ranked order the predicted brands by their confidence scores (probabilities) and the top 300 unique brands are selected. [sent-264, score-0.847]
75 We manually verify the 300 predicted brands and found that 90. [sent-265, score-0.435]
76 33% ofthe predicted brands are indeed names of designers or women’s apparel stores (true positive), resulting a precision score of 90. [sent-266, score-0.587]
77 Indeed, the precision score presented above is obtained using an exact matching criterion where partial extraction of a brand is regarded as a miss, i. [sent-268, score-0.517]
78 The left column of Table 3 shows examples of newly discovered brands from Women’s clothing category. [sent-271, score-0.782]
79 Many of these newly discovered brands are indeed misspelled versions of the known brands in the seed dictionary. [sent-272, score-1.14]
80 The middle column of Table 3 shows a set of Men’s clothing brands learned automatically from a similar experiment conducted on a set of 105, 335 listings from Men’s clothing category. [sent-273, score-1.326]
81 t06s26e% t2 Table 4: NER Accuracy on 2 test sets as the seed dictionary for brands grows. [sent-278, score-0.618]
82 Results shown here are obtained the same Men’s clothing category dataset, as used to show the supervised NER results in Table 2. [sent-279, score-0.35]
83 tial set of 3, 499 known brand seeds, we partition the dataset into a training set of 67, 307 listings and a test set of 38, 028 listings (for later reference we refer to this test set as set A). [sent-280, score-1.18]
84 We carry out a similar experiment to grow the seed dictionary for garment type, and are able to identify the top 60 new garment types. [sent-283, score-0.5]
85 Examples of the newly discovered garment types are shown in Table 3 (right column), where abbreviated forms of garment types such as jkt (short for jacket) and pjs (short for pajamas) are also discovered through our algorithm. [sent-285, score-0.42]
86 By adding these newly discovered attributes back to the dictionary, we can now re-evaluate our supervised NER system from section 4 with the grown seed list. [sent-286, score-0.344]
87 To this end, we construct 2 test sets from the same 105, 335 listings of Men’s clothing category as used in Section 4. [sent-287, score-0.665]
88 Test set 1 is a set of 500 listings randomly sampled from the 38, 028listing subset known not to contain any brands in the original brand seed dictionary (set A). [sent-288, score-1.442]
89 Since this dataset is known to not contain any brands from the original brand seed dictionary, the addition of 200 new brands solely accounts for all the accuracy boost. [sent-290, score-1.444]
90 Test set 2 is constructed slightly differently by randomly sampling 500 listings from the entire 105, 335 listings of Men’s clothing category. [sent-291, score-0.991]
91 Normalizing the variants of a known brand to a single normalized output value is an important aspect of our attribute extraction algorithm, as these variants account for over 20% of listings in the eBay clothing and shoes category. [sent-295, score-1.513]
92 In this work, since the attribute values are often partially extracted, i. [sent-300, score-0.234]
93 In our experiment, we find the optimal n for brands to be 3 and 4. [sent-310, score-0.412]
94 Table 5 shows a few examples of normalized outputs as a result of finding the best match for the extracted brand names from among a set of predefined normalized values. [sent-311, score-0.568]
95 When the best matching score falls below a threshold, we declare no match is found and classify the extracted brand as a new brand. [sent-312, score-0.479]
96 In our experiments with brand normalization, over 50% of the matches from the Jaro-Winkler distance are, however, identified as being incorrect. [sent-317, score-0.434]
97 Focusing on the clothing and shoes categories on eBay’s site, we presented a bootstrapped algorithm that can identify new brand names corresponding to (1) spelling invariants or typographical errors of the known brands in the seed list and (2) novel brands or designers. [sent-321, score-1.959]
98 Our attribute extractor — — correctly discovers new brands with over 90% precision on multiple corpora of listings. [sent-322, score-0.646]
99 To output normalized attribute values, we explore several fuzzy string comparison algorithms and found n-gram substring matching to work well in practice. [sent-323, score-0.382]
100 (2004), Biomedical named entity recognition using conditional random fields and rich feature sets, in Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004, Geneva, Switzerland. [sent-534, score-0.266]
wordName wordTfidf (topN-words)
[('brand', 0.434), ('brands', 0.412), ('listings', 0.356), ('clothing', 0.279), ('ebay', 0.256), ('attribute', 0.234), ('listing', 0.156), ('seed', 0.152), ('ner', 0.14), ('garment', 0.134), ('shoes', 0.1), ('entity', 0.092), ('nadeau', 0.089), ('titles', 0.081), ('crf', 0.081), ('hmm', 0.075), ('product', 0.074), ('svm', 0.072), ('men', 0.072), ('ghani', 0.067), ('maxent', 0.067), ('classifier', 0.063), ('named', 0.063), ('discovered', 0.061), ('women', 0.06), ('attributes', 0.06), ('bootstrapping', 0.06), ('recognition', 0.059), ('names', 0.058), ('dictionary', 0.054), ('membership', 0.05), ('acronym', 0.048), ('viterbi', 0.046), ('matching', 0.045), ('kickers', 0.045), ('loom', 0.045), ('probst', 0.045), ('sellers', 0.045), ('supervised', 0.041), ('substring', 0.041), ('temporal', 0.039), ('indeed', 0.039), ('extraction', 0.038), ('char', 0.038), ('chieu', 0.038), ('fruit', 0.038), ('normalized', 0.038), ('color', 0.038), ('jones', 0.037), ('entropy', 0.036), ('classifiers', 0.035), ('known', 0.034), ('apparel', 0.033), ('camel', 0.033), ('gruhl', 0.033), ('kanaris', 0.033), ('pakhomov', 0.033), ('bootstrapped', 0.032), ('normalization', 0.032), ('category', 0.03), ('brody', 0.03), ('newly', 0.03), ('yves', 0.029), ('ratnaparkhi', 0.029), ('mccallum', 0.028), ('exclusive', 0.027), ('fields', 0.027), ('acronyms', 0.026), ('laurent', 0.026), ('minkov', 0.026), ('grow', 0.026), ('substrings', 0.026), ('conditional', 0.025), ('elhadad', 0.025), ('title', 0.025), ('sequential', 0.025), ('expansion', 0.025), ('kondrak', 0.024), ('typographical', 0.024), ('string', 0.024), ('predicted', 0.023), ('informal', 0.022), ('identity', 0.022), ('character', 0.022), ('automobile', 0.022), ('bengali', 0.022), ('calvin', 0.022), ('designers', 0.022), ('ekbal', 0.022), ('fano', 0.022), ('faruqui', 0.022), ('halacsy', 0.022), ('hunpos', 0.022), ('invariants', 0.022), ('krema', 0.022), ('krishnan', 0.022), ('lauren', 0.022), ('separating', 0.021), ('gathering', 0.021), ('tag', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
Author: Duangmanee Putthividhya ; Junling Hu
Abstract: We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.
2 0.16351198 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Author: Matthias Hartung ; Anette Frank
Abstract: This paper introduces an attribute selection task as a way to characterize the inherent meaning of property-denoting adjectives in adjective-noun phrases, such as e.g. hot in hot summer denoting the attribute TEMPERATURE, rather than TASTE. We formulate this task in a vector space model that represents adjectives and nouns as vectors in a semantic space defined over possible attributes. The vectors incorporate latent semantic information obtained from two variants of LDA topic models. Our LDA models outperform previous approaches on a small set of 10 attributes with considerable gains on sparse representations, which highlights the strong smoothing power of LDA models. For the first time, we extend the attribute selection task to a new data set with more than 200 classes. We observe that large-scale attribute selection is a hard problem, but a subset of attributes performs robustly on the large scale as well. Again, the LDA models outperform the VSM baseline.
3 0.092187747 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
4 0.080482766 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
Author: Florian Laws ; Christian Scheible ; Hinrich Schutze
Abstract: Supervised classification needs large amounts of annotated training data that is expensive to create. Two approaches that reduce the cost of annotation are active learning and crowdsourcing. However, these two approaches have not been combined successfully to date. We evaluate the utility of active learning in crowdsourcing on two tasks, named entity recognition and sentiment detection, and show that active learning outperforms random selection of annotation examples in a noisy crowdsourcing scenario.
5 0.073695533 128 emnlp-2011-Structured Relation Discovery using Generative Models
Author: Limin Yao ; Aria Haghighi ; Sebastian Riedel ; Andrew McCallum
Abstract: We explore unsupervised approaches to relation extraction between two named entities; for instance, the semantic bornIn relation between a person and location entity. Concretely, we propose a series of generative probabilistic models, broadly similar to topic models, each which generates a corpus of observed triples of entity mention pairs and the surface syntactic dependency path between them. The output of each model is a clustering of observed relation tuples and their associated textual expressions to underlying semantic relation types. Our proposed models exploit entity type constraints within a relation as well as features on the dependency path between entity mentions. We examine effectiveness of our approach via multiple evaluations and demonstrate 12% error reduction in precision over a state-of-the-art weakly supervised baseline.
6 0.072025917 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
7 0.060837489 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
8 0.059461202 96 emnlp-2011-Multilayer Sequence Labeling
9 0.057431303 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
10 0.054250624 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
11 0.052119806 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
12 0.051188339 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
13 0.049867641 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases
14 0.048276134 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
15 0.045640793 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
16 0.045616131 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus
17 0.043818191 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
18 0.041242909 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns
19 0.041059822 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
20 0.040475771 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
topicId topicWeight
[(0, 0.154), (1, -0.102), (2, -0.062), (3, -0.012), (4, -0.041), (5, 0.014), (6, -0.002), (7, -0.029), (8, -0.065), (9, 0.023), (10, 0.028), (11, 0.022), (12, -0.092), (13, 0.1), (14, -0.147), (15, -0.065), (16, 0.1), (17, -0.016), (18, 0.011), (19, 0.028), (20, 0.05), (21, 0.085), (22, 0.031), (23, -0.187), (24, 0.015), (25, -0.021), (26, 0.232), (27, -0.02), (28, -0.137), (29, -0.028), (30, -0.138), (31, 0.268), (32, 0.092), (33, 0.075), (34, 0.167), (35, -0.032), (36, 0.099), (37, -0.159), (38, 0.097), (39, -0.061), (40, 0.045), (41, -0.092), (42, -0.076), (43, -0.205), (44, 0.021), (45, 0.03), (46, 0.105), (47, -0.044), (48, 0.061), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.92602068 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
Author: Duangmanee Putthividhya ; Junling Hu
Abstract: We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.
2 0.56220019 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Author: Matthias Hartung ; Anette Frank
Abstract: This paper introduces an attribute selection task as a way to characterize the inherent meaning of property-denoting adjectives in adjective-noun phrases, such as e.g. hot in hot summer denoting the attribute TEMPERATURE, rather than TASTE. We formulate this task in a vector space model that represents adjectives and nouns as vectors in a semantic space defined over possible attributes. The vectors incorporate latent semantic information obtained from two variants of LDA topic models. Our LDA models outperform previous approaches on a small set of 10 attributes with considerable gains on sparse representations, which highlights the strong smoothing power of LDA models. For the first time, we extend the attribute selection task to a new data set with more than 200 classes. We observe that large-scale attribute selection is a hard problem, but a subset of attributes performs robustly on the large scale as well. Again, the LDA models outperform the VSM baseline.
3 0.51646221 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
Author: Jette Viethen ; Robert Dale ; Markus Guhe
Abstract: Traditional computational approaches to referring expression generation operate in a deliberate manner, choosing the attributes to be included on the basis of their ability to distinguish the intended referent from its distractors. However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously in the discourse, implying less deliberate choice and more subconscious reuse. This raises the question as to which is a more accurate characterisation of what people do. Using a corpus of dialogues containing 16,358 referring expressions, we explore this question via the generation of subsequent references in shared visual scenes. We use a machine learning approach to referring expression generation and demonstrate that incorporating features that correspond to the computational tradition does not match human referring behaviour as well as using features corresponding to the process of alignment. The results support the view that the traditional model of referring expression generation that is widely assumed in work on natural language generation may not in fact be correct; our analysis may also help explain the oft-observed redundancy found in humanproduced referring expressions.
4 0.34786987 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
5 0.34444627 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
6 0.32885921 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
7 0.30313647 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
8 0.28692013 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
9 0.27549681 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
10 0.27422717 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
11 0.27258348 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
12 0.23858051 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
13 0.23524266 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
14 0.21220019 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
15 0.21147394 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
16 0.20487213 129 emnlp-2011-Structured Sparsity in Structured Prediction
17 0.19966634 128 emnlp-2011-Structured Relation Discovery using Generative Models
18 0.19936019 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
19 0.19631833 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
20 0.19313055 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases
topicId topicWeight
[(15, 0.013), (16, 0.309), (23, 0.112), (36, 0.033), (37, 0.047), (45, 0.07), (52, 0.011), (53, 0.017), (54, 0.02), (57, 0.031), (62, 0.028), (64, 0.02), (66, 0.033), (69, 0.017), (79, 0.044), (82, 0.023), (87, 0.011), (90, 0.014), (96, 0.037), (98, 0.021)]
simIndex simValue paperId paperTitle
1 0.77730924 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
Author: Dani Yogatama ; Michael Heilman ; Brendan O'Connor ; Chris Dyer ; Bryan R. Routledge ; Noah A. Smith
Abstract: We consider the problem of predicting measurable responses to scientific articles based primarily on their text content. Specifically, we consider papers in two fields (economics and computational linguistics) and make predictions about downloads and within-community citations. Our approach is based on generalized linear models, allowing interpretability; a novel extension that captures first-order temporal effects is also presented. We demonstrate that text features significantly improve accuracy of predictions over metadata features like authors, topical categories, and publication venues.
same-paper 2 0.69922835 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
Author: Duangmanee Putthividhya ; Junling Hu
Abstract: We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.
3 0.45729807 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
4 0.45616972 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman
Abstract: In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.
5 0.45554599 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
Author: Marco Dinarelli ; Sophie Rosset
Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.
6 0.45380849 136 emnlp-2011-Training a Parser for Machine Translation Reordering
7 0.45362946 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
8 0.45290577 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
9 0.45044646 128 emnlp-2011-Structured Relation Discovery using Generative Models
10 0.44980517 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
11 0.44871739 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
12 0.44841948 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
13 0.44737783 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
14 0.44618604 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
15 0.44600376 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
16 0.44600177 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
17 0.44553363 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
18 0.44396925 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
19 0.4436326 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
20 0.44216439 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics